2010-12-27

Google knows how to divide text into words

... in Chinese, that is. This is no small feat, because Chinese text, when written in the usual way (in Chinese characters) does not reflect in any way the division of text into words (with the exception of some very special cases, such as when transcribing foreign personal names into Chinese). When Chinese speakers need to write a sentence in Pinyin (Latin transcription), they often end up writing every syllable as a separate word, or, more rarely, run all words together. (The photo above shows both possibilities). Most automatic Chinese-characters-to-Pinyin converters also separate the transcription of all characters with spaces. Google Translate, however, appears to have a pretty good idea how to put spaces between words in Pinyin. Getting to Pinyin, though, is a bit tricky. To do it, one can enter a Chinese phrase, ask Google Translate to "translate" it into another version of Chinese (e.g., simplified to traditional), and click on the "Read phonetically" link below, which will give you the Pinyin transcription of the phrase. E.g., for "有可能朱棣立神道碑加工期间,发现龟趺脖子下裂缝而弃之" ("Perhaps, during Zhu Di's installation of the Sacred Way Stele, cracks were discovered under the neck of the stone tortoise [serving as as the pedestal] and it was abandoned") you get "Yǒu kěnéng zhūdì lì shéndàobēi jiāgōng qíjiān, fāxiàn guī fū bózi xià lièfèng ér qì zhī". Which I think is pretty good for a machine, although of course the name Zhu Di should be capitalized and written with a space, and I would probably write "guīfū" ("tortoise-shaped pedestal") as a single word.

1 comment: