Removing Vietnamese diacritic in Java
This post uses many Unicode concepts. If you are not familiar with Unicode, please read Java in the Unicode of Madness before continuing.
Diacritic: a mark that is used to create a new character from an original character with a different pronunciation.
Accent: is a type of diacritic that changes the tone of a character.
Vietnamese
The Vietnamese language has 29 characters and 12 vowels.
Roman character: a b c d e g h i k l m n o p q r s t u v x y
Character with diacritic: ă â đ ê ô ơ ư
Accent: ◌́ (acute), ◌̀ (grave), ◌̉, (hook), ◌̃ (tilde), ◌̣ (dot)
The acute and grave accent
Unlike 3 other accents, acute and grave has 2 Unicode Code points.
Accent mark | Code point | Name |
◌̀ | U+0300 | Combining Grave Accent |
◌̀ | U+0340 | Combining Grave Tone Mark |
Accent mark | Code point | Name |
◌́ | U+0301 | Combining Acute Accent |
◌́ | U+0341 | Combining Acute Tone Mark |
The composed and decomposed form
Vietnamese characters in strings can be in two forms: composed or decomposed. Composed forms are single Code points, while decomposed forms are combinations of an alphabet with diacritics and accents. For example:
Character | Composed | Decomposed |
á | á or á | a◌́ or a◌́ |
ẳ | ẳ | a◌̆◌̉ |
đ | đ | đ |
In the Vietnamese language, characters and their composed forms are almost identical. However, the decomposed form has two special cases. First, the acute and grave accents have two code points each. Second, the character đ
cannot be decomposed into d
with a stroke diacritic mark.
Regular Expression
Regular expressions support Unicode through the \p
syntax. To match diacritics and accents, Regex uses the \p{M}
or \p{Mark}
property class.
In Java, the Pattern
and Matcher
classes are used to work with regular expressions. The String
class also supports Regular Expressions through the replaceAll()
method, but this simply uses the Pattern
and Matcher
classes internally. One thing to note is that the Matcher class is thread-safe, but it has poor performance due to synchronization.
Removing Vietnamese diacritic
This is the challenge of removing diacritics from Vietnamese characters and returning alphabet characters.
Normalization
The most general solution is to decompose Vietnamese characters into their constituent alphabet and diacritic characters. Then use Regular Expression to remove the diacritic character.
String text = "Tiếng Việt có đấu";
// T i e ◌̂ ◌́ ng V i e ◌̣ ◌̂ t c o ◌́ đ a ◌̂ ◌́ u
String decompositedForm = Normalizer.normalize(text, Normalizer.Form.NFD);
decompositedForm.replaceAll("\\p{M}", ""); // Tieng Viet co đau
This solution has several advantages. First, it works for Vietnamese and many other languages. Second, it can also handle both composed and decomposed forms. Finally, Java already has the Normalizer
class to decompose characters.
However, this solution is not high-performance because Strings must go through the Normalization algorithm and Regular Expression. Additionally, this approach will not work with the đ
character, so it requires an additional replacement step to convert it to d
.
decompositedForm.replaceAll("\\p{M}", "").replace('đ', 'd'); // Tieng Viet co dau
Replacing
This solution is more familiar than Normalization and less complex, as it only deals with Vietnamese. The idea is simple: use the replaceAll()
method to apply a rule to groups of Vietnamese characters that have the same alphabet after removing diacritics.
String text = "Tiếng Việt có đấu";
text = text.replaceAll("[AÁÀÃẠÂẤẦẪẬĂẮẰẴẶ]", "A")
.replaceAll("[àáạảãâầấậẩẫăằắặẳẵ]", "a");
.replaceAll("[EÉÈẼẸÊẾỀỄỆ]", "E");
.replaceAll("[èéẹẻẽêềếệểễ]", "e");
.replaceAll("[IÍÌĨỊ]", "I");
.replaceAll("[ìíịỉĩ]", "i");
.replaceAll("[OÓÒÕỌÔỐỒỖỘƠỚỜỠỢ]", "O");
.replaceAll("[òóọỏõôồốộổỗơờớợởỡ]", "o");
.replaceAll("[UÚÙŨỤƯỨỪỮỰ]", "U");
.replaceAll("[ùúụủũưừứựửữ]", "u");
.replaceAll("[YÝỲỸỴ]", "Y");
.replaceAll("[ỳýỵỷỹ]", "y");
.replaceAll("Đ", "D");
.replaceAll("đ", "d");
.replaceAll("\u0300|\u0301|\u0303|\u0309|\u0323|\u0340|\u0341|\u02C6|\u0306|\u031B", "");
Although this solution may seem straightforward, it is not efficient because it wastes many replaceAll()
method calls.
Mapping
This is the most verbose approach, by mapping one Vietnamese character and its alphabet character.
String text = "Tiếng Việt có đấu";
Map<Character, Character> map = new HashMap<>() {{
put('á', 'a');
put('à', 'a');
put('ả', 'a');
put('ã', 'a');
put('ạ', 'a');
put('ắ', 'a');
put('ằ', 'a');
put('ẳ', 'a');
put('ẵ', 'a');
put('ặ', 'a');
put('ấ', 'a');
put('ầ', 'a');
put('ẩ', 'a');
put('ẫ', 'a');
put('ậ', 'a');
put('é', 'e');
put('è', 'e');
put('ẻ', 'e');
put('ẽ', 'e');
put('ẹ', 'e');
put('ế', 'e');
put('ề', 'e');
put('ể', 'e');
put('ễ', 'e');
put('ệ', 'e');
put('í', 'i');
put('ì', 'i');
put('ỉ', 'i');
put('ĩ', 'i');
put('ị', 'i');
put('ó', 'o');
put('ò', 'o');
put('ỏ', 'o');
put('õ', 'o');
put('ọ', 'o');
put('ố', 'o');
put('ồ', 'o');
put('ổ', 'o');
put('ỗ', 'o');
put('ộ', 'o');
put('ớ', 'o');
put('ờ', 'o');
put('ở', 'o');
put('ỡ', 'o');
put('ợ', 'o');
put('ú', 'u');
put('ù', 'u');
put('ủ', 'u');
put('ũ', 'u');
put('ụ', 'u');
put('ứ', 'u');
put('ừ', 'u');
put('ử', 'u');
put('ữ', 'u');
put('ự', 'u');
put('ý', 'y');
put('ỳ', 'y');
put('ỷ', 'y');
put('ỹ', 'y');
put('ỵ', 'y');
put('đ', 'd');
put('Á', 'A');
put('À', 'A');
put('Ả', 'A');
put('Ã', 'A');
put('Ạ', 'A');
put('Ắ', 'A');
put('Ằ', 'A');
put('Ẳ', 'A');
put('Ẵ', 'A');
put('Ặ', 'A');
put('Ấ', 'A');
put('Ầ', 'A');
put('Ẩ', 'A');
put('Ẫ', 'A');
put('Ậ', 'A');
put('É', 'E');
put('È', 'E');
put('Ẻ', 'E');
put('Ẽ', 'E');
put('Ẹ', 'E');
put('Ế', 'E');
put('Ề', 'E');
put('Ể', 'E');
put('Ễ', 'E');
put('Ệ', 'E');
put('Í', 'I');
put('Ì', 'I');
put('Ỉ', 'I');
put('Ĩ', 'I');
put('Ị', 'I');
put('Ó', 'O');
put('Ò', 'O');
put('Ỏ', 'O');
put('Õ', 'O');
put('Ọ', 'O');
put('Ố', 'O');
put('Ồ', 'O');
put('Ổ', 'O');
put('Ỗ', 'O');
put('Ộ', 'O');
put('Ớ', 'O');
put('Ờ', 'O');
put('Ở', 'O');
put('Ỡ', 'O');
put('Ợ', 'O');
put('Ú', 'U');
put('Ù', 'U');
put('Ủ', 'U');
put('Ũ', 'U');
put('Ụ', 'U');
put('Ứ', 'U');
put('Ừ', 'U');
put('Ử', 'U');
put('Ữ', 'U');
put('Ự', 'U');
put('Ý', 'Y');
put('Ỳ', 'Y');
put('Ỷ', 'Y');
put('Ỹ', 'Y');
put('Ỵ', 'Y');
put('Đ', 'd');
}};
StringBuilder sb = new StringBuilder(text);
for (int i = 0; i < sb.length(); i++) {
char c = sb.charAt(i);
if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || c == ' ') {
continue;
}
Character alphabet = map.get(c);
if (alphabet == null) {
sb.deleteCharAt(i);
} else {
sb.setCharAt(i, alphabet);
}
}
sb.toString();
Although it is verbose, it is highly efficient compared to the other two solutions. It uses a map to store and access characters, and it does not use the replaceAll()
method.
Benchmark
Tool: JMH 1.36
OS: Ubuntu 22.04
CPU: I5-1135G7
Benchmark | Score |
Mapping | 27966412.833 ± 145145.204 ops/s |
Normalization | 1486144.004 ± 17256.300 ops/s |
Replacing | 153442.074 ± 1862.282 ops/s |
Reference
Subscribe to my newsletter
Read articles from Nguyen Hoang Nam directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Nguyen Hoang Nam
Nguyen Hoang Nam
Back-end Developer, Writer and Book Enthusiast