Unicode Gatekeepers
S. Gilles
2017-06-01
The problem
There has been a lot[0][1] of thought put into security implications
of unicode. A popular concern is that harmless string A might be
confused with evil string B by a human observer, who might then
be tricked into giving their login credentials to an attacker or
running code signed by an attacker.
That's not the part I'm interested in right now, however. There's
another concern: a system which accepts unicode allows spammers
a far wider playing field. It might not be a security risk for a
phishing attempt to include a link to paypa🄻.com, but it's annoying
to receive an email offering “DISCOUNT GENERIC 🄰 🅂 🄿 🄸 🅁 🄸 🄽 ”
because your spam filter was only trained on ASCII. In the language
of UTS #39, this is “fooling mechanical gatekeepers”.
UTS #39 is designed to handle the first concern. It includes (or
references, or describes, whatever) a file[2] named confusables.txt,
which defines a equivalences classes of unicode glyphs as the
consortium feels relevant. There are some other bits like multi-glyph
decomposition, but those aren't critical.
The problem is that there is no equivalent of confusables.txt for
gatekeepers. But suppose we really want one. Suppose we're building
software for a mailing list, or a BBS, or something, and we want
to give administrators the ability to detect unicode-obfuscated
spam by providing their own blacklist of keywords. How hard would
that be?
Initial reaction: ”Don't do that”
One obvious and completely foolproof method is to not support all
of unicode. Messages must be in ASCII, or Latin 1, or some other
small range.
The problem with this is that people tend to expect unicode support
these days. Even Slashcode was updated[3] to handle UTF-8 in 2014,
after all. For the sake of argument, let's assume that our biggest
customer has a habit of learning new languages and will be upset
if the software has to be reconfigured for each new alphabet.
Best Attempt
A second approach might be to map everything via confusables
before applying our checking. We might perform something along
the lines of (untested)
typedef struct {
Rune from;
Rune *to;
} replacement_t;
replacement_t reps[] = { ... };
#define NUM_REPLACEMENTS ((sizeof reps)/(sizeof reps[0]))
int rcomp(void *a, void *b)
{
Rune *k = a;
replacement_t *rep = b;
if (*k < rep->from) {
return -1;
} else if (*k > rep->from) {
return 1;
}
return 0;
}
int check_spam(const Rune *message, const Rune *badword)
{
Rune *normalized_message = 0;
Rune *normalized_badword = 0;
Rune *p = 0;
replacement_t *v = { 0 };
int ret = 0;
for (p = message; *p; p++) {
if ((v = bsearch(p, reps, NUM_REPLACEMENTS,
sizeof *reps, rcomp))) {
append_to_string(&normalized_message, v->to);
}
}
for (p = badword; *p; p++) {
if ((v = bsearch(p, reps, NUM_REPLACEMENTS,
sizeof *reps, rcomp))) {
append_to_string(&normalized_badword, v->to);
}
}
if (rune_strstr(normalized_message, normalized_badword)) {
ret = -1;
}
free(normalized_message);
free(normalized_badword);
return ret;
}
Aside from knowing the ‘...’ of reps, this isn't impossible, and
is even readable. If the algorithm flows this way, we can use sed
or awk to turn confusables.txt into the contents of ‘...’.
Issues
A few problems show up immediately.
o Extraneous characters. U+0224 LATIN CAPITAL LETTER Z WITH HOOK
should probably be replaced with a simple Z, but it is listed
as confusable with the sequence U+005A, U+0326 (a Z followed
by COMBINING COMMA BELOW).
o Overzealous replacing. “m” is confusable with “rn”, so it's
impossible to distinguish “yarn” and “yam”, in case we need
to censor vegetables or something. This could be a problem.
o Missing replacements. Codepoints like U+1F130 SQUARED LATIN
CAPITAL LETTER A should probably map to good old U+0041 LATIN
CAPITAL LETTER A. However, since the two glyps are visually
quite distinct, they are not considered confusable.
The last part is the most concerning. As newer releases of Unicode
offer more characters, confusables.txt may be updated, but since
there does not exist a “gatekeeper-confusables.txt”.
Finally
The solution I'm using is the following (generally) algorithm for
filling in the ‘...’ above. Depending on the real-world testing
it gets, it may need tweaking.
Parse confusables.txt into a map T: codepoint -> string.
Parse UnicodeData.txt[4], and
For every codepoint C
If the general category is M, C, P, or Sk
Make T delete that codepoint, by T(C) := ''
End
T(U+0020) := ' ' (it was mistakenly deleted above)
If there's no mapping
If UnicodeData.txt indicates a decomposition D1, D2, ...
T(C) = D1, D2, ...
End
End
End
(Now T deletes the ‘extra’ characters.)
Delete mappings for '0', '1', 'I', 'm', 'w', '|'.
Add mappings
T(U+0460) := 'w'
T(U+FF29) := 'I'
T(U+1F170) := 'A' (and so on through 'Z')
T(U+00A9) := 'C'
T(U+00AE) := 'R'
T(U+00DF) := 'B'
T(U+01AB) := 't'
T(U+0272) := 'n'
T(U+0274) := 'N'
T(U+0291) := 'z'
T(U+0298) := 'O'
T(U+029F) := 'L'
T(U+02B3) := 'r'
T(U+0629) := 'o'
T(U+0644) := 'J'
T(U+1472) := 'b'
T(U+1473) := 'b'
T(U+1D07) := 'E'
T(U+2117) := 'P'
T(U+2365) := 'O'
T(U+A793) := 'e'
T(U+2776) := '(1)' (and so on through '(10)')
T(U+2780) := '(1)' (and so on through '(10)')
T(U+278A) := '(1)' (and so on through '(10)')
Until T is fixed
T := T ∘ T
Output T in the form needed for rune transformation.
It would be nice for the Unicode Consortium to address this more
concretely.
[0] http://www.unicode.org/reports/tr36
[1] http://www.unicode.org/reports/tr39
[2] http://www.unicode.org/Public/security/latest/confusables.txt
[3] https://raw.githubusercontent.com/SoylentNews/rehash/fe00cc16d95c4b8bef921f9757a6af29712c7501/README.utf8
[4] http://www.unicode.org/Public/UNIDATA/UnicodeData.txt