Class UTF8

  • All Implemented Interfaces:

    
    public class UTF8
    
                        

    Utilities for working with UTF-8 encodings.

    Decoding of UTF-8 is based on a presentation by Bob Steagall at CppCon2018 (see https://github.com/BobSteagall/CppCon2018). It uses a Deterministic Finite Automaton (DFA) to recognize and decode multi-byte code points.

    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
    • Field Summary

      Fields 
      Modifier and Type Field Description
    • Constructor Summary

      Constructors 
      Constructor Description
      UTF8()
    • Enum Constant Summary

      Enum Constants 
      Enum Constant Description
    • Method Summary

      Modifier and Type Method Description
      static int transcodeToUTF16(Array<byte> utf8, Array<char> utf16) Transcode a UTF-8 encoding into a UTF-16 representation.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • UTF8

        UTF8()
    • Method Detail

      • transcodeToUTF16

         static int transcodeToUTF16(Array<byte> utf8, Array<char> utf16)

        Transcode a UTF-8 encoding into a UTF-16 representation. In the general case the output utf16 array should be at least as long as the input utf8 one to handle arbitrary inputs. The number of output UTF-16 code units is returned, or -1 if any errors are encountered (in which case an arbitrary amount of data may have been written into the output array). Errors that will be detected are malformed UTF-8, including incomplete, truncated or "overlong" encodings, and unmappable code points. In particular, no unmatched surrogates will be produced. An error will also result if utf16 is found to be too small to store the complete output.

        Parameters:
        utf8 - A non-null array containing a well-formed UTF-8 encoding.
        utf16 - A non-null array, at least as long as the utf8 array in order to ensure the output will fit.
        Returns:

        The number of UTF-16 code units written to utf16 (beginning from index 0), or else -1 if the input was either malformed or encoded any unmappable characters, or if the utf16 is too small.