Unicode Gatekeepers S.Gilles The

Found at: sdf.org:70/users/sgilles/phlog/2017-06-01_unicode_gatekeepers.txt

Unicode Gatekeepers

S. Gilles


The problem

  There has been a lot[0][1] of thought put into security implications
  of unicode. A popular concern is that harmless string A might be
  confused with evil string B by a human observer, who might then
  be tricked into giving their login credentials to an attacker or
  running code signed by an attacker.

  That's not the part I'm interested in right now, however. There's
  another concern: a system which accepts unicode allows spammers
  a far wider playing field. It might not be a security risk for a
  phishing attempt to include a link to paypa🄻.com, but it's annoying
  to receive an email offering “DISCOUNT GENERIC 🄰 🅂 🄿 🄸 🅁 🄸 🄽 ”
  because your spam filter was only trained on ASCII. In the language
  of UTS #39, this is “fooling mechanical gatekeepers”.

  UTS #39 is designed to handle the first concern. It includes (or
  references, or describes, whatever) a file[2] named confusables.txt,
  which defines a equivalences classes of unicode glyphs as the
  consortium feels relevant. There are some other bits like multi-glyph
  decomposition, but those aren't critical.

  The problem is that there is no equivalent of confusables.txt for
  gatekeepers. But suppose we really want one. Suppose we're building
  software for a mailing list, or a BBS, or something, and we want
  to give administrators the ability to detect unicode-obfuscated
  spam by providing their own blacklist of keywords. How hard would
  that be?

Initial reaction: ”Don't do that”

  One obvious and completely foolproof method is to not support all
  of unicode. Messages must be in ASCII, or Latin 1, or some other
  small range.

  The problem with this is that people tend to expect unicode support
  these days. Even Slashcode was updated[3] to handle UTF-8 in 2014,
  after all. For the sake of argument, let's assume that our biggest
  customer has a habit of learning new languages and will be upset
  if the software has to be reconfigured for each new alphabet.

Best Attempt

  A second approach might be to map everything via confusables
  before applying our checking. We might perform something along
  the lines of (untested)

    typedef struct {
            Rune from;
            Rune *to;
    } replacement_t;

    replacement_t reps[] = { ... };
    #define NUM_REPLACEMENTS ((sizeof reps)/(sizeof reps[0]))

    int rcomp(void *a, void *b)
            Rune *k = a;
            replacement_t *rep = b;
            if (*k < rep->from) {
                    return -1;
            } else if (*k > rep->from) {
                    return 1;
            return 0;

    int check_spam(const Rune *message, const Rune *badword)
            Rune *normalized_message = 0;
            Rune *normalized_badword = 0;
            Rune *p = 0;
            replacement_t *v = { 0 };
            int ret = 0;

            for (p = message; *p; p++) {
                    if ((v = bsearch(p, reps, NUM_REPLACEMENTS,
                                     sizeof *reps, rcomp))) {
                            append_to_string(&normalized_message, v->to);

            for (p = badword; *p; p++) {
                    if ((v = bsearch(p, reps, NUM_REPLACEMENTS,
                                     sizeof *reps, rcomp))) {
                            append_to_string(&normalized_badword, v->to);

            if (rune_strstr(normalized_message, normalized_badword)) {
                    ret = -1;


            return ret;

  Aside from knowing the ‘...’ of reps, this isn't impossible, and
  is even readable. If the algorithm flows this way, we can use sed
  or awk to turn confusables.txt into the contents of ‘...’.


  A few problems show up immediately.

   o Extraneous characters. U+0224 LATIN CAPITAL LETTER Z WITH HOOK
     should probably be replaced with a simple Z, but it is listed
     as confusable with the sequence U+005A, U+0326 (a Z followed

   o Overzealous replacing. “m” is confusable with “rn”, so it's
     impossible to distinguish “yarn” and “yam”, in case we need
     to censor vegetables or something. This could be a problem.

   o Missing replacements. Codepoints like U+1F130 SQUARED LATIN
     CAPITAL LETTER A should probably map to good old U+0041 LATIN
     CAPITAL LETTER A. However, since the two glyps are visually
     quite distinct, they are not considered confusable.

  The last part is the most concerning. As newer releases of Unicode
  offer more characters, confusables.txt may be updated, but since
  there does not exist a “gatekeeper-confusables.txt”.


  The solution I'm using is the following (generally) algorithm for
  filling in the ‘...’ above. Depending on the real-world testing
  it gets, it may need tweaking.

    Parse confusables.txt into a map T: codepoint -> string.

    Parse UnicodeData.txt[4], and
    For every codepoint C
      If the general category is M, C, P, or Sk
        Make T delete that codepoint, by T(C) := ''

      T(U+0020) := ' ' (it was mistakenly deleted above)

      If there's no mapping
        If UnicodeData.txt indicates a decomposition D1, D2, ...
          T(C) = D1, D2, ...

    (Now T deletes the ‘extra’ characters.)

    Delete mappings for '0', '1', 'I', 'm', 'w', '|'.

    Add mappings
      T(U+0460) := 'w'
      T(U+FF29) := 'I'
      T(U+1F170) := 'A' (and so on through 'Z')
      T(U+00A9) := 'C'
      T(U+00AE) := 'R'
      T(U+00DF) := 'B'
      T(U+01AB) := 't'
      T(U+0272) := 'n'
      T(U+0274) := 'N'
      T(U+0291) := 'z'
      T(U+0298) := 'O'
      T(U+029F) := 'L'
      T(U+02B3) := 'r'
      T(U+0629) := 'o'
      T(U+0644) := 'J'
      T(U+1472) := 'b'
      T(U+1473) := 'b'
      T(U+1D07) := 'E'
      T(U+2117) := 'P'
      T(U+2365) := 'O'
      T(U+A793) := 'e'
      T(U+2776) := '(1)' (and so on through '(10)')
      T(U+2780) := '(1)' (and so on through '(10)')
      T(U+278A) := '(1)' (and so on through '(10)')

   Until T is fixed
     T := T ∘ T

   Output T in the form needed for rune transformation.

  It would be nice for the Unicode Consortium to address this more

[0] http://www.unicode.org/reports/tr36
[1] http://www.unicode.org/reports/tr39
[2] http://www.unicode.org/Public/security/latest/confusables.txt
[3] https://raw.githubusercontent.com/SoylentNews/rehash/fe00cc16d95c4b8bef921f9757a6af29712c7501/README.utf8
[4] http://www.unicode.org/Public/UNIDATA/UnicodeData.txt