R Tip of the Day

The tmcn package

Nhi Luong

2025-10-23

Let’s explore tmcn!

Short Introduction

The tmcn package is a text mining toolkit for the Chinese language.

  • We can use functions from the package to:
  • Convert from Traditional Chinese to Simplified Chinese (and reverse)
  • Convert Chinese text to pinyin format
  • Output dictionary of Chinese stop words
  • Give some useful information about a character such as pinyin, radicals, stroke number of radical

Convert Chinese text to Pinyin: toPinyin()

  • Pinyin is a romanized spelling of Chinese words. For example:
  • 妈妈 –> ma1ma1
  • 爸爸 –> ba4ba

Let’s try it!

chinese_words <- c("爸爸","妈妈","圣奥拉夫","老师","学生","电脑","大学","秋季")

Let’s try it!

chinese_words <- c("爸爸","妈妈","圣奥拉夫","老师","学生","电脑","大学","秋季")
toPinyin(chinese_words)
[1] "baba"        "mama"        "shengaolafu" "laoshi"      "xuesheng"   
[6] "diannao"     "daxue"       "qiuji"      

How about a table?

Chinese Words Pinyin English Translation
爸爸 baba father
妈妈 mama mother
圣奥拉夫 shengaolafu Saint Olaf
老师 laoshi teacher
学生 xuesheng student
电脑 diannao computer
大学 daxue university
秋季 qiuji fall (season)

Convert traditional to simplified: toTrad()

chinese_words <-c("爸爸","妈妈","圣奥拉夫","老师","学生","电脑","大学","秋季")

Convert traditional to simplified: toTrad()

chinese_words <-c("爸爸","妈妈","圣奥拉夫","老师","学生","电脑","大学","秋季")
toTrad(chinese_words)
[1] "爸爸"     "媽媽"     "聖奧拉夫" "老師"     "學生"     "電腦"     "大學"    
[8] "秋季"    
Simplified Traditional English Translation
妈妈 媽媽 mother
电脑 電腦 computer

Explore GBK Dataset

  • GBK dataset provides users some useful information of a character such as pinyin, radical, stroke numbers of radical.
GBK |>
  as_tibble() |>
  slice_head(n = 3)
# A tibble: 3 × 8
  GBK   py0   py        Radical Stroke_Num_Radical Stroke_Order Structure   Freq
  <chr> <chr> <chr>     <chr>                <int> <chr>        <chr>      <dbl>
1 吖    a     ā yā      口                       3 丨フ一丶ノ丨 左右          26
2 阿    a     ā ɑ ē     阝                       2 フ丨一丨フ一丨…… 左右      526031
3 啊    a     ɑ á à ǎ ā 口                       3 丨フ一フ丨一丨フ一丨…… 左中右     53936

Explore GBK Dataset

chinese_words_split <- c("爸","爸","妈","妈","圣","奥","拉","夫","老","师","学","生","电","脑","大","学","秋","季")
chinese_words_split |>
  as.tibble() |>
  distinct(value) |>
  inner_join(GBK, join_by(value == GBK)) |>
  slice_head(n = 5) |>
  select(1:5)

Explore GBK Dataset

  • GBK dataset provides users some useful information of a character such as pinyin, radicals, stroke numbers of radical.
chinese_words_split <- c("爸","爸","妈","妈","圣","奥","拉","夫","老","师","学","生","电","脑","大","学","秋","季")
chinese_words_split |>
  as.tibble() |>
  distinct(value) |>
  inner_join(GBK, join_by(value == GBK)) |>
  slice_head(n = 5) |>
  select(1:5)
# A tibble: 5 × 5
  value py0   py          Radical Stroke_Num_Radical
  <chr> <chr> <chr>       <chr>                <int>
1 爸    ba    bà          父                       4
2 妈    ma    mā          女                       3
3 圣    sheng shènɡ       土                       3
4 奥    ao    ào yù       大                       3
5 拉    la    lá là lǎ lā 扌                       3

In action!

  • “女” + “马” = “妈”

In action!

  • “woman” + “horse” = “mother”
  • “女” + “马” = “妈”

Journey to the West

  • One of the most famous Chinese novels is “Journey to the West”. Below is an excerpt from chapter 4.
journey_west[[1]][1]
[1] "話表齊天大聖到底是個妖猴,更不知官銜品從,也不較俸祿高低,但只註名便 了。那齊天府下二司仙吏,早晚伏侍,只知日食三餐,夜眠一榻,無事牽縈, 自由自在。閑時節會友遊宮,交朋結義。見三清稱個「老」字,逢四帝道個 「陛下」。與那九曜星、五方將、二十八宿、四大天王、十二元辰、五方五老 、普天星相、河漢群神,俱只以弟兄相待,彼此稱呼。今日東遊,明日西蕩, 雲去雲來,行蹤不定。"
  • Looking at some Chinese stop words from STOPWORDS dataset
head(STOPWORDS)
  word
1 第二
2 一番
3 一直
4 一个
5 一些
6 许多

Can we make a wordcloud for chapter 4?

journey_west_token <- journey_west |>
  unnest_tokens(word, "第五回 亂蟠桃大聖偷丹 反天宮諸神捉怪", token = "words")

journey_dfs <- journey_west_token |>
  anti_join(STOPWORDS) |>
  mutate(simplified = toTrad(word, rev = T)) |>
  count(simplified, sort = T) |>
  slice_head(n = 50) |>
  data.frame()

wordcloud2(
  journey_dfs, 
  size = 1.2, 
  shape = 'cardioid',
  minSize = 15
)