How to correctly normalize strings and how to compare them (in .NET)

As a developer, you sometimes have to correctly normalize strings. Be it to do quick case-insensitive lookups, or to compare stuff. The question is, what is considered a correct normalization for these use cases? This post uses C# for the sample code, but this topic applies to all languages and environments equally.

The easy (but wrong) way

The most common approach is to simply call .ToLower() on the string to do case-insensitive comparisons after that. That, however, is wrong on a few levels. The first problem is, that this will use the current culture of the current thread to do to conversion. This might result in different - and very surprising - effects, depending on the current culture.

Let's take this example: You would expect "Konny".StartsWith("Kon") to evaluate to true. If, however, for some reason the thread where this code is run at is set to use the Hungarian culture (hu and more specifically hu-HU), the result is false. The reason is that in Hungarian, nny is a so-called multigraph, and considered a single letter in that culture.

You can try that out for yourself by running this simple piece of code:

foreach (var culture in CultureInfo.GetCultures(CultureTypes.AllCultures))
{
   if (!"Konny".StartsWith("kon", true, culture))
   {
      // You really don't expect the code to end up here, do you?
      Console.WriteLine($"'Konny' doesn't start with in 'kon' in {culture}");
   }
}

And, to be very clear about this: If you don't happen to know that by chance, you'd be debugging an issue like that for hours and hours, if not days.

The better, but still wrong way

So, the next logical step is to use .ToLowerInvariant() when you need normalized strings, simply to rule out the cultural differences between these transformations, right? That is what Microsoft did back in the days with their Membership system. There they stored user names and emails for quick search in a lowercase normalized way.

Well, turn out even Microsoft makes mistakes sometimes. The issue with lowercase normalization is, that there are, again, different cultures where this loses information or ends up in an incorrect representation of the original data. The most prominent example is the Turkish letter İ. This is a I with a dot. This corresponds to the normal i we know. The thing is that our I also gets converted to i when you normalize to lowercase. So, after normalization you can't tell if the original character was an İ or an I. Besides that, the Turkish culture also has an dotless i, which looks like ı and is the correct lowercase version of the Turkish I.

This is not only a problem with Turkish texts. There are also other examples for characters that can't be correctly converted back to their uppercase representation when you normalize to lowercase. One of them is the Greek Rho Symbol (U+03f1) ϱ which uses the Capital Greek Rho (U+03a1) Ρ for uppercase and the Small Greek Rho (U+03c1) ρ in lowercase.

How to correctly normalize strings

So, the best way, if you need to do normalization in the first place, is to user .ToUpperInvariant() instead. Also Microsoft noticed these issues and put up a whole page about best practices for using strings.

Now that we have normalized strings, we can easily do fast lookups for names and the likes. But what, when we don't want lookups for names, where these linguistic details really matter? What if we want to compare strings of non-linguistic nature? Since there are other issues that could arise Microsoft strongly suggest to use StringComparison.OrdinalIgnoreCase, which is also faster.

Conclusion

There are a lot of nasty surprises hidden in strings, case sensitivity and cultures. I strongly recommend to save the link to the best practices for using strings page. If only to have it handy and read some parts from time to time.

Besides that, it probably is a good idea to activate the Roslyn code analyzers, especially the globalization warnings. This lets the compiler tell you about potential issues in your code, you could trip over some day. Your future-you will be thankful 😉

The easy (but wrong) way

The better, but still wrong way

How to correctly normalize strings

Conclusion

Share this:

Related