项目作者: ELLIOTTCABLE

项目描述 :
Convert between JavaScript UCS-2-encoded strings and OCaml-friendly UTF-8 byte-arrays.
高级语言: TypeScript
项目地址: git://github.com/ELLIOTTCABLE/ocaml-string-convert.git
创建时间: 2019-08-16T03:55:18Z
项目社区:https://github.com/ELLIOTTCABLE/ocaml-string-convert

开源协议:

下载


ocaml-string-convert

This is a library to shim BuckleScript’s string-handling when using native-OCaml string-manipulation libraries.

Background

When using [BuckleScript][] to compile [OCaml][] source-code to JavaScript, no attempt is made to handle the runtime conversion of string values between the two semantic systems.

In particular, a String value in JavaScript is (basically) a UCS-2 character array. The closest you can get to “give me the thingie X thingies into the length of the string” is String::charCodeAt; specifically, this function returns Nth UTF-16 code unit into the string.

Meanwhile, over in OCaml land, the type string (and the functions in the module String) is, semantically, a dumb byte array. That is, when you ask the OCaml compiler for a_string.[0], you don’t get the first character of the string, or even a Unicode-aware codepoint or grapheme; instead, you get the first byte of (what OCaml believes to be) a series of opaque bytes.

Unfortunately, BuckleScript compiles the latter syntax (a_string.[0]) into the former semantic (a_string.charCodeAt(0)); this only makes sense within the very limited range of the ASCII-compatible bytes; that is, between 0-127.

Let’s experiment with the following small program. It’ll take an input string on the command-line, extract the first … character? byte? and then tell us about it.

  1. (* str_test.ml *)
  2. let first_char_info s =
  3. let c = s.[0] in
  4. "Code: " ^ string_of_int (Char.code c) |> print_endline;
  5. "String: " ^ String.make 1 c |> print_endline
  6. (* Change the "1" to a "2" to execute this with Node.js. Annoyingly. *)
  7. let () = first_char_info Sys.argv.(1)

The above works, both when compiled via the traditional OCaml toolchain, and when compiled to JavaScript and executed with Node.js … but only when the entire string is within the ASCII range:

  1. $ bsc str_test.ml
  2. $ node str_test.js hello
  3. Code: 104
  4. String: h
  5. $ ocaml str_test.ml hello
  6. Code: 104
  7. String: h

Let’s try the same thing with an non-ASCII, international string:

  1. $ node str_test.js جمل
  2. Code: 1580
  3. String: ج
  4. $ ocaml str_test.ml جمل
  5. Code: 216
  6. String: ?

Ruh-roh. The problem here comes from this series of exchanges:

  1. The value s in the above program comes in as a UTF-8 encoded string; that’s what the shell is passing along to the program in Sys.argv.

  2. Node.js understands and expects this; and converts the incoming value into its internal format, UCS-2; this means that s.charCodeAt(0) is going to be the first UCS code-point of that input string as encoded in UCS-2. That is to say, "ج", integer value 1580.

  3. An OCaml program, unaware that it’s being compiled via BuckleScript, expects string values arising from UTF-8 input (like s) to be addressed bytewise; that is, they’d expect s.[0] to yield “\xD8” (216) and s.[1] to yield “\xAC” (172), the two bytes of the UTF-8 encoding of the codepoint ‘ج’.

tl;dr OCaml libraries expecting to operate UTF-8 byte-arrays (like Sedlex, [Menhir][], [Camomile][], any of Daniel Bünzli’s Unicode-handling libraries) are going to break when compiled to JavaScript via BuckleScript and fed actual UTF-8 input.

[BuckleScript]: https://bucklescript.github.io
[OCaml]: https://ocaml.org

[Menhir]: http://gallium.inria.fr/~fpottier/menhir
[Camomile]: https://github.com/yoriyuki/Camomile

Solution

This library provides a shim for this behaviour. Unicode input to a JavaScript program can be fed through the functions provided by this library, which uses the TextEncoder and TextDecoder APIs (or the fast-text-encoding npm module as a shim therefor) to transform the UCS-2 strings being passed around by JavaScript systems, into TypedArrays of UTF-8 bytes. These UTF-8 values will then be copied back into (now malformed, but predictably-malformed) JavaScript Strings; these can be passed with impunity to UTF-8 handling OCaml functions, which will now function as expected.

Note: This package is not necessary for code written specifically for BuckleScript; just be aware of the BuckleScript-specific semantics of the .[] string-indexing operator. This package is only necessary if you’re A. writing a library that’s intended to be used both by native projects and JavaScript projects, or B. if you’re using a native-targeting library from [opam][] and compiling it to JavaScript.

[opam]: https://opam.ocaml.org

Usage

Install ocaml-string-convert with npm:

  1. npm install --save ocaml-string-convert

Include it on the JavaScript side of your project:

  1. import {
  2. toFakeUTF8String,
  3. fromFakeUTF8String
  4. } from 'ocaml-string-convert'

toFakeUTF8String(str)

This function is intended to be called on JavaScript strings (possibly containing Unicode characters outside the ASCII range) that need to be passed to OCaml functions; it ‘double-encodes’ those strings such that they will be perceived by BuckleScript-compiled OCaml as UTF-8-encoded char-arrays.

Input

This function takes one argument, a ‘standard’ JavaScript String; that is, one with Unicode characters outside the ASCII range (but still within the BMP!) encoded as single, 16-bit code-units; and higher-plane characters encoded as UTF-16-style surrogate pairs.

  • Example, as a UCS-2 sequence of 16-bit code-units:

    1. [102, 111, 111, 183, 98, 97, 114]
  • Example, as typed into a UTF-8 JavaScript source-file:

    1. "foo·bar"

Output

An abomination. This produces a JavaScript String (that is still technically encoded as UCS-2,
mind you!) containing a series of UTF-8 bytes, as interpreted as UCS-2 codepoints.

  • Example, as a UCS-2 sequence of 16-bit code-units:

    1. [102, 111, 111, 194, 183, 98, 97, 114]
  • Example, as typed into a UTF-8 JavaScript source-file:

    1. "foo\xC2\xB7bar" // or "foo·bar", if you're a heathen

See that, in this example, the non-ASCII character U+00B7 “MIDDLE DOT”, which is one code-unit (literally \u00B7) in the original input-string, is encoded as two JavaScript / UCS-2 code-units, \xC2\xB7 — C2-B7 being the UTF-8 encoding of U+00B7.

fromFakeUTF8String(str)

The inverse operation to the above.

Given a double-encoded (effectively, mis-encoded) BuckleScript ‘string’ that’s been manipulated as if it’s a UTF-8 char-array, this function will decode (effectively, re-encode) that value into a functional, correct JavaScript (i.e. UCS-2) string.

Takes a String, containing a series of UTF-8 bytes encoded as Unicode codepoints (in JavaScript’s standard UCS-2, that is); returns a standard JavaScript String with those Unicode scalars properly represented in UCS-2 code units, ready for standard JavaScript manipulation.

A Note on Types

Given that readers of this are almost guaranteed to write OCaml, it will probably surprise noboby that I prefer the ability to use nominal types. This is not, however, standard TypeScript practice.

This library’s TypeScript interface (which I hope I’m exporting correctly, by the way; I’m rather new to publishing a TypeScript-enabled library!) mints a new type for string_as_utf_8_buffer. Idiomatic usage would be to tag every stringish return-value from a BuckleScript module with this type:

  1. import { toFakeUTF8String, fromFakeUTF8String } from 'ocaml-string-convert'
  2. import $AModule from './aModule.bs'
  3. let $yuck = $AModule.returns_a_string() as string_as_utf_8_buffer
  4. // ... manipulation ...
  5. let str = fromFakeUTF8String($yuck)

(As you can see, I also like to follow a different naming-convention for values I know to contain opaque values produced by the BuckleScript runtime.)

You can, of course, dispense with my convention at your earliest convenience, if you can’t stand the (hopefully helpful?) type-errors that this produces; I do not, of course, suggest that you do so:

  1. import { toFakeUTF8String, fromFakeUTF8String } from 'ocaml-string-convert'
  2. import $AModule from './aModule.bs'
  3. function from(str: string): string {
  4. fromFakeUTF8String(str as string_as_utf_8_buffer)
  5. }
  6. let $yuck = $AModule.returns_a_string()
  7. // ... manipulation ...
  8. let str = from($yuck)