Writing a Japanese conjugation parser

I wrote this a while ago. As a result some of the information may be dated. I still think this is pretty interesting, and I still use and like DIRE

Recently I have been writing my own Japanese dictionary software. One of the key features I want in a dictionary app is to be able to look up a word even when it is conjugated. There are some libraries to do this, but DIRE rolls its own, and it is unexpectedly simple, so I thought I would talk about it here.

Japanese conjugation has a few features which make it nice to work with: conjugations are mostly standardized, and they almost always come at the end of the word. There are exceptions¹ but they are rare enough that I handle them separately. So to de-conjugate a word I just have to pattern match the end of it. The second is that some conjugations are stacked versions of other conjugations.

This means that if we have a word, say for example 食べたかった then I would see that たかった can be de-conjugated to たい　getting me 食べたい. Then I see that たい can be de-conjugated into る giving me 食べる which is the original word⁴.

More formally this can be summarized as:

Add the current word to the results
Check if the word can be de-conjugated again
Add de-conjugations to the to-do list and return to step 1

This will return a list of possible words, I can then filter them based on which are words in the dictionary.

For my list of conjugations I decided to use the one that yomichan² uses available here.

Looking at the slightly modified code snippet that does this:

# Return all possible deconjugations do not check if actually a word
def deconjugate_word(word, col, db):
    conjugations = sql_wrapper.get_conjugations(db)
    seen = [word]
    possibilities = [word]
    result = []
    num_conj = 0
    # This does exactly what it sounds like. col will be explained later
    lookup = search_word(word, db, col=col)
    while len(possibilities) > 0:
        cur_word = possibilities.pop()
        for inflection in conjugations:
            if cur_word.endswith(inflection[0]):
                # There is a more efficient way to do this, but I was lazy at the time
                new_word = re.sub(inflection[0] + '$',
                        inflection[1],
                        cur_word)
                num_conj+=1
                if new_word not in seen:
                    possibilities.append(new_word)
                    seen.append(new_word)
                if len(new_word) == 0:
                    continue
                lookup = search_word(new_word, db, col=col)
                if len(lookup) > 0:
                    result.append(Entry_Set(lookup, len(word), num_conj))

    return result

Then comes the problem of sorting results. I take the following approach:

Results that came from a longer section of text are sorted before those that came from shorter sections of text
If two results came from the same section of text then the one that had fewer conjugations comes first

The first rule is relevant because often the start of a word can itself be a word. For example, both "食べる" and "食" are in the dictionary. So we should prioritize the result we got from looking at the full word and not just the first character.

The second rule is important because sometimes a word is valid and de-conjugates to another valid word. This is particularly true when a dictionary includes separate entires for specific conjugations of words or entries for expressions. For example, some dictionaries contain dedicated entries for "かねない" or "ありえない" despite these being negative conjugations of other words. To handle this we store results from one lookup in a Entry_Set object, and define how they should be sorted:

class Entry_Set():
    def __init__(self, entries, orig_len, num_deconjugatins):
        self.entries = entries
        self.orig_len = orig_len
        self.num_deconjugations = num_deconjugatins
        # We also allow the user to sort dictionaries but that isn't super relevant for this article
        if config.dict_order != None:
            def key(entry):
                return config.dict_order[entry.dictionary]
            entries = sorted(entries, key=key)
        self.entries = entries

    def __lt__(self, other):
        if self.orig_len == other.orig_len:
            return self.num_deconjugations < other.num_deconjugations
        return self.orig_len > other.orig_len
    def __gt__(self, other):
        if self.orig_len == other.orig_len:
            return self.num_deconjugations > other.num_deconjugations
        return self.orig_len < other.orig_len
    def __eq__(self, other):
        return self.entries[0] == other.entries[0]
    def __le__(self, other):
        if self.orig_len == other.orig_len:
            return self.num_deconjugations <= other.num_deconjugations
        return self.orig_len >= other.orig_len
    def __ge__(self, other):
        if self.orig_len == other.orig_len:
            return self.num_deconjugations >= other.num_deconjugations
        return self.orig_len <= other.orig_len
    def __ne__(self, other):
        return self.entries[0] != other.entries[0]

Finally comes the issue of multiple writings. In Japanese for most words there are at least 3 ways to write it. You can use a "standard" writing with kanji, write it with only hiragana, or you can use katakana. So 食べる could be:

食べる
たべる
タベル

To convert from the first to the other two is somewhat hard³, but converting the third and second have a one to one mapping making converting trivial.

def katakana_to_hiragana(word):
    kana_dict= { 'ア': 'あ', 'カ': 'か', 'サ': 'さ', 'タ': 'た', 'ナ': 'な', 'ハ': 'は',
    'マ': 'ま', 'ヤ': 'や', 'ラ': 'ら', 'ワ': 'わ', 'イ': 'い', 'キ': 'き', 'シ': 'し',
    'チ': 'ち', 'ニ': 'に', 'ヒ': 'ひ', 'ミ': 'み', 'リ': 'り', 'ヰ': 'ゐ', 'ウ': 'う',
    'ク': 'く', 'ス': 'す', 'ツ': 'つ', 'ヌ': 'ぬ', 'フ': 'ふ', 'ム': 'む', 'ユ': 'ゆ',
    'ル': 'る', 'エ': 'え', 'ケ': 'け', 'セ': 'せ', 'テ': 'て', 'ネ': 'ね', 'ヘ': 'へ',
    'メ': 'め', 'レ': 'れ', 'ヱ': 'ゑ', 'オ': 'お', 'コ': 'こ', 'ソ': 'そ', 'ト': 'と',
    'ノ': 'の', 'ホ': 'ほ', 'モ': 'も', 'ヨ': 'よ', 'ロ': 'ろ', 'ヲ': 'を', 'ン': 'ん',
    'ャ': 'ゃ', 'ュ': 'ゅ', 'ョ': 'ょ', 'ガ': 'が', 'ギ': 'ぎ', 'グ': 'ぐ', 'ゲ': 'げ',
    'ゴ': 'ご', 'ザ': 'ざ', 'ジ': 'じ', 'ズ': 'ず', 'ゼ': 'ぜ', 'ゾ': 'ぞ', 'ダ': 'だ',
    'ヂ': 'ぢ', 'ヅ': 'づ', 'デ': 'で', 'ド': 'ど', 'バ': 'ば', 'ビ': 'び', 'ブ': 'ぶ',
    'ベ': 'べ', 'ボ': 'ぼ', 'パ': 'ぱ', 'ピ': 'ぴ', 'プ': 'ぷ', 'ペ': 'ぺ', 'ポ': 'ぽ',
    'ー': 'ー', "ッ": "っ"}
    result = ""
    for char in word:
        if char in kana_dict:
            result += kana_dict[char]
        else:
            result += char

    return result

This is why there was that col argument to search_word earlier. Because when we lookup a word we lookup the word as written, and as pronounced. This way if a word is written using katakana we can still look it up.

That is pretty much how it all works. In the actual program there are a few more complications, but all of the essentials are here.

For example する conjugates in a non-standard way so each conjugation is handled separately. ↩
A lot of the design of this algorithm is the same as yomichan. Yomichan is great software. ↩
There are advanced ways to do this. This is what libraries like mecab do. I felt this would be a bit overkill for now, but might add something similar in the future. ↩
If you are particutlarly familar with Japanese you might have noted we can go back into the mizenkei (未然形). We don't go that far back and just stop at the verb form for simplicity. ↩

Nick's Website

Writing a Japanese conjugation parser