• A Unicode character can be up to 4 bytes, so 2^32 or 4,294,967,296 potential unique characters. And it’d be easy enough to adjust the standard to allow for an extra byte(s) if necessary – it’s been done before.

    • Turun@feddit.de
      link
      fedilink
      arrow-up
      4
      ·
      edit-2
      1 year ago

      This is incorrect. While in UTF-32 a character (actually a code point) requires 4 bytes, and in UTF-8 up to 4 bytes, the Unicode standard is limited to 17*2^16 code points. (edit: apparently because that is the limit of UTF-16. 4 Byte UTF-8 can encode 2^21 code points, but it is not technically limited to four bytes, so in total is a ble to encode 2^31 code points)

      Unicode is the standard that says “the thing we call captial A is the 65th character”, literally defining a mapping from numbers to concepts.
      UTF-8 or UTF-32 are a way to encode a list of numbers in a more (UTF-8) or less (UTF-32) efficient way.