R Tip of the Day

The tmcn package

Nhi Luong

2025-10-23

Let’s explore tmcn!

Short Introduction

The tmcn package is a text mining toolkit for the Chinese language.

We can use functions from the package to:

Convert from Traditional Chinese to Simplified Chinese (and reverse)
Convert Chinese text to pinyin format
Output dictionary of Chinese stop words
Give some useful information about a character such as pinyin, radicals, stroke number of radical

Convert Chinese text to Pinyin: toPinyin()

Pinyin is a romanized spelling of Chinese words. For example:

妈妈 –> ma1ma1

爸爸 –> ba4ba

Let’s try it!

chinese_words <- c("爸爸","妈妈","圣奥拉夫","老师","学生","电脑","大学","秋季")

Let’s try it!

chinese_words <- c("爸爸","妈妈","圣奥拉夫","老师","学生","电脑","大学","秋季")
toPinyin(chinese_words)

[1] "baba"        "mama"        "shengaolafu" "laoshi"      "xuesheng"   
[6] "diannao"     "daxue"       "qiuji"

How about a table?

Chinese Words	Pinyin	English Translation
爸爸	baba	father
妈妈	mama	mother
圣奥拉夫	shengaolafu	Saint Olaf
老师	laoshi	teacher
学生	xuesheng	student
电脑	diannao	computer
大学	daxue	university
秋季	qiuji	fall (season)

Convert traditional to simplified: toTrad()

chinese_words <-c("爸爸","妈妈","圣奥拉夫","老师","学生","电脑","大学","秋季")

Convert traditional to simplified: toTrad()

chinese_words <-c("爸爸","妈妈","圣奥拉夫","老师","学生","电脑","大学","秋季")
toTrad(chinese_words)

[1] "爸爸"     "媽媽"     "聖奧拉夫" "老師"     "學生"     "電腦"     "大學"    
[8] "秋季"

Simplified	Traditional	English Translation
妈妈	媽媽	mother
电脑	電腦	computer

Explore GBK Dataset

GBK dataset provides users some useful information of a character such as pinyin, radical, stroke numbers of radical.

GBK |>
  as_tibble() |>
  slice_head(n = 3)

# A tibble: 3 × 8
  GBK   py0   py        Radical Stroke_Num_Radical Stroke_Order Structure   Freq
  <chr> <chr> <chr>     <chr>                <int> <chr>        <chr>      <dbl>
1 吖    a     ā yā      口                       3 丨フ一丶ノ丨 左右          26
2 阿    a     ā ɑ ē     阝                       2 フ丨一丨フ一丨…… 左右      526031
3 啊    a     ɑ á à ǎ ā 口                       3 丨フ一フ丨一丨フ一丨…… 左中右     53936

Explore GBK Dataset

chinese_words_split <- c("爸","爸","妈","妈","圣","奥","拉","夫","老","师","学","生","电","脑","大","学","秋","季")
chinese_words_split |>
  as.tibble() |>
  distinct(value) |>
  inner_join(GBK, join_by(value == GBK)) |>
  slice_head(n = 5) |>
  select(1:5)

Explore GBK Dataset

GBK dataset provides users some useful information of a character such as pinyin, radicals, stroke numbers of radical.

chinese_words_split <- c("爸","爸","妈","妈","圣","奥","拉","夫","老","师","学","生","电","脑","大","学","秋","季")
chinese_words_split |>
  as.tibble() |>
  distinct(value) |>
  inner_join(GBK, join_by(value == GBK)) |>
  slice_head(n = 5) |>
  select(1:5)

# A tibble: 5 × 5
  value py0   py          Radical Stroke_Num_Radical
  <chr> <chr> <chr>       <chr>                <int>
1 爸    ba    bà          父                       4
2 妈    ma    mā          女                       3
3 圣    sheng shènɡ       土                       3
4 奥    ao    ào yù       大                       3
5 拉    la    lá là lǎ lā 扌                       3

In action!

“女” + “马” = “妈”

In action!

“woman” + “horse” = “mother”
“女” + “马” = “妈”

Journey to the West

One of the most famous Chinese novels is “Journey to the West”. Below is an excerpt from chapter 4.

journey_west[[1]][1]

[1] "話表齊天大聖到底是個妖猴，更不知官銜品從，也不較俸祿高低，但只註名便 了。那齊天府下二司仙吏，早晚伏侍，只知日食三餐，夜眠一榻，無事牽縈， 自由自在。閑時節會友遊宮，交朋結義。見三清稱個「老」字，逢四帝道個 「陛下」。與那九曜星、五方將、二十八宿、四大天王、十二元辰、五方五老 、普天星相、河漢群神，俱只以弟兄相待，彼此稱呼。今日東遊，明日西蕩， 雲去雲來，行蹤不定。"

Looking at some Chinese stop words from STOPWORDS dataset

head(STOPWORDS)

  word
1 第二
2 一番
3 一直
4 一个
5 一些
6 许多

journey_west_token <- journey_west |>
  unnest_tokens(word, "第五回 亂蟠桃大聖偷丹　反天宮諸神捉怪", token = "words")

journey_dfs <- journey_west_token |>
  anti_join(STOPWORDS) |>
  mutate(simplified = toTrad(word, rev = T)) |>
  count(simplified, sort = T) |>
  slice_head(n = 50) |>
  data.frame()

wordcloud2(
  journey_dfs, 
  size = 1.2, 
  shape = 'cardioid',
  minSize = 15
)