---
title: "regioncode: One-Step Solution for Chinese Region Conversions"
author: 
  - "HU Yue, YE Xinyi"
date: "`r Sys.Date()`"
output: 
    rmarkdown::html_vignette:
      keep_md: false
vignette: >
  %\VignetteEncoding{UTF-8}
  %\VignetteIndexEntry{regioncode: One-Step Solution for Chinese Region Conversions}
  %\VignetteEngine{knitr::rmarkdown}

bibliography: s_regioncode.bib

editor_options: 
  markdown: 
    wrap: sentence
---

```{r}
#| label: setup
#| include: false

knitr::opts_chunk$set(message = FALSE, warning = FALSE)

if(!require(regioncode)) install.packages("regioncode")
library(regioncode)
library(dplyr)

load("../R/sysdata.rda")
vec_yearRange <- names(region_data) |>
  grep("^\\d{4}_code$", x = _, value = TRUE) |>
  substr(start = 1, stop = 4) |> 
  as.numeric()
```

In China, the term "city" can refer to a county, prefecture, or province. 
This ambiguity creates challenges for researchers, who often struggle to convert regional names and their corresponding geocodes, especially in datasets that span many years. 
The central government periodically changes or eliminates administrative unit names, further complicating these conversions[@GuoJiaTongJiJu2022]. 
Inspired by Vincent Arel-Bundock's `countrycode` package, we developed `regioncode` to streamline the conversion of Chinese region names and codes for the years `r paste(min(vec_yearRange), "--", max(vec_yearRange))`.


# Why `regioncode`?

The Chinese government assigns a unique geocode to each county, prefecture, and province, and consistently adjusts and updates these "administrative division codes" to support national and regional development plans [@MinZhengBu2022]. 
These adjustments, however, challenge researchers who conduct longitudinal studies or merge geospatial data from different years. 
For example, when inconsistencies exist between map data and statistical data, any attempt to render the statistical information on a map of China can produce errors.

`regioncode` provides a one-step solution to these challenges.
The package seamlessly converts formal names, common names, and administrative division codes for Chinese provinces and prefectures over thirty years (`r paste(min(vec_yearRange), "--", max(vec_yearRange))`).
Apart from code conversion, the package also support matching administrative divisions with their economic and linguistic characteristics.


# Installation

To install:

-  The latest released version: `install.packages("regioncode")`.
-  The latest developing version: `remotes::install_github("sammo3182/regioncode")`.

# Basic Usage

We demonstrate the basic functionality of `regioncode` using a sample dataset randomly drawn from @Wang2020c's China Corruption Investigations Dataset. 
The package uses `code` to denote administrative division codes and `name` for the formal names of regions. 
It can convert between these two formats. 

A user simply provides a character vector of names or a numeric vector of geocodes to the function and specifies the desired output with the `convert_to` argument.
The following example converts geocodes from 2019 to their 1989 equivalents. 
Users must set the `year_from` argument to the correct reference year and then use the `year_to` and `convert_to` arguments to specify the target year and output format.

```{r}
#| label: code2code

library(regioncode)

data("corruption")

# Conversion to the 1989 version
regioncode(
  data_input = corruption$prefecture_id,
  convert_to = "code", # default setting
  year_from = 2019,
  year_to = 1989
)

# Comparison
tibble(
  code2019 = corruption$prefecture_id,
  code1989 = regioncode(
    data_input = corruption$prefecture_id,
    convert_to = "code", # default setting
    year_from = 2019,
    year_to = 1989
  ),
  name2019 = regioncode(
    data_input = corruption$prefecture_id,
    convert_to = "name", # default setting
    year_from = 2019,
    year_to = 2019
  ),
  name1989 = regioncode(
    data_input = corruption$prefecture_id,
    convert_to = "name", # default setting
    year_from = 2019,
    year_to = 1989
  )
)
```

Note that if a region geocoded in 1989 was later absorbed into a new region by 2019, the package uses the new region's geocode. 
If a single large area was later divided into several smaller regions, the package aligns the new codes with the first new region based on the ascending numerical order of their geocodes.^[Users may notice that `regioncode` sometimes outputs provincially-administered counties ("直辖县"). 
We include some of these units to minimize missing data, although they remain county-level administrative units. 
Current resources do not permit coverage of all such counties, an issue we plan to address in the future (github issues \#54). 
We welcome users to contribute a pull request or contact us to help resolve this issue.]
`regioncode` automatically identifies the input format, treating numeric vectors as geocodes and character vectors as names. 
The following example demonstrates converting various input types into different output formats:

```{r}
#| label: code2name

# Original name
tibble(
  id = corruption$prefecture_id,
  name = corruption$prefecture
)

# Codes to name
regioncode(
  data_input = corruption$prefecture_id,
  convert_to = "name",
  year_from = 2019,
  year_to = 1989
)

# Name to codes of the same year
regioncode(
  data_input = corruption$prefecture,
  convert_to = "code",
  year_from = 2019,
  year_to = 2019
)

# Name to name of a different year
regioncode(
  data_input = corruption$prefecture,
  convert_to = "name",
  year_from = 2019,
  year_to = 1989)
```

# Advanced Applications

The `regioncode` package also provides specialized functions for more complex data and diverse research needs, including:

1. Conversion from/to incomplete names.
1. Different handling of municipalities.
1. Return of population-based city ranks.
1. Return of pinyin format of outputs.
1. Conversion of provincial data.
1. Return of administrative areas.
1. Return of linguistic zones.

## Incomplete Naming of Prefectures

Datasets often record geographic information using incomplete names that omit the administrative level, such as "北京" for "北京市" or "内蒙" for "内蒙古自治区." 
To handle this type of data, a user can set the `incomplete_name` argument to `TRUE`. 
`regioncode` can perform the conversion as long as at least two characters are available to identify the city or province. 
In the following example, we shorten 70% of the city names in the input vector to demonstrate how `regioncode` resolves this issue:


```{r}
#| label: incomplete_name

# Original full names
corruption$prefecture

fake_incomplete <- corruption$prefecture

index_incomplete <- sample(seq(length(corruption$prefecture)), 7)

fake_incomplete[index_incomplete] <- fake_incomplete[index_incomplete] |>
  substr(start = 1, stop = 2)

fake_incomplete

# Conversion to full names in 2008
regioncode(
  data_input = fake_incomplete,
  convert_to = "name",
  year_from = 2019,
  year_to = 2008,
  incomplete_name = TRUE
)

```

## Municipalities

In China, municipalities ("直辖市") are geographically cities but function administratively as provinces. 
Different datasets may categorize them differently, with some treating them as equivalent to prefectures. 
The `regioncode` package includes the `zhixiashi` argument to manage this distinction. 
The default setting, `FALSE`, treats municipalities as provinces. 
When set to `TRUE`, the function treats them as prefectures and uses their provincial codes as their geocodes. 
The following example demonstrates this functionality using a character vector containing the names of municipalities, their districts, and a prefecture:

```{r}
#| label: municipality

names_municipality <- c(
  "北京市", # Beijing, a municipality
  "海淀区", # A district of Beijing
  "上海市", # Shanghai, a municipality
  "静安区", # A district of Shanghai
  "济南市"
) # A prefecture of Shandong

# When `zhixiashi` is FALSE, only the districts are recognized
regioncode(
  data_input = names_municipality,
  year_from = 2019,
  year_to = 2019,
  convert_to = "code",
  zhixiashi = FALSE
)

# When `zhixiashi` is TRUE, municipalities are recognized
regioncode(
  data_input = names_municipality,
  year_from = 2019,
  year_to = 2019,
  convert_to = "code",
  zhixiashi = TRUE
)
```

## City Ranking

The *Statistical Yearbook of Urban and Rural Construction* classifies Chinese cities into different tiers based largely on population [@GuoJiaTongJiJu2022a]. 
A four-tier system existed from 1989 to 2014, after which the government expanded it to a seven-tier system, as detailed in the following table:

| Criterion           | Population                | Rank         
|:-------------------|-----------------------|------------------|
| Old (1989)| > 1 million            |   超大城市       |
|                    | 500,000 ~ 1 million   |   大城市         |
|                    | 200,000 ~ 500,000     |   中等城市       |
|                    | < 200,000              |   小城市         |
|                    |                       |                  |
| New (2014)| > 10 million           |   超大城市       |
|                    | 5 million ~ 10 million|   特大城市       |
|                    | 3 million ~ 5 million |   I型大城市      |
|                    | 1 million ~ 3 million |   II型大城市     |
|                    | 500,000 ~ 1 million   |   中等城市       | 
|                    | 200,000 ~ 500,000     |   I型小城市      |
|                    | < 200,000              |   II型小城市     |  

The `regioncode` function can return the population-based rank of a city for a given year. 
Users can perform this conversion by setting `convert_to = "rank"`. 
The function applies the old ranking system for years up to and including 1989 and the new system for all subsequent years. 
If a city's population data is unavailable in the official sources, the function returns `NA`. 
The following example compares the city ranks generated from the same input vector but for different years:


```{r}
#| label: rank

tibble(
  city = corruption$prefecture,
  rank1989 = regioncode(
    data_input = corruption$prefecture,
    year_from = 2019,
    year_to = 1989,
    convert_to = "rank"
  ),
  rank2014 = regioncode(
    data_input = corruption$prefecture,
    year_from = 2019,
    year_to = 2014,
    convert_to = "rank"
  ))
```

## Pinyin

Pinyin provides a phonetic romanization of Chinese characters, and some datasets store region names in pinyin. 
While `regioncode` defaults to outputting names in Chinese characters, users can obtain pinyin output by setting the `to_pinyin` argument to `TRUE`. 
This functionality is integrated from the `pinyin` package developed by Peng Zhao and Qu Cheng. 
The function also produces the correct romanization for regions with unique spellings, such as Shanxi versus Shaanxi, Inner Mongolia, and the special administrative regions. 
This pinyin conversion works for official names, incomplete names, and administrative area outputs. 
The following example shows how the function works for different requests:

```{r}
#| label: pinyin

tibble(
  city = corruption$prefecture,
  cityPY = regioncode(
    data_input = corruption$prefecture,
    year_from = 2019,
    year_to = 1989,
    convert_to = "name",
    to_pinyin = TRUE
  ),
  areaPY = regioncode(
    data_input = corruption$prefecture,
    year_from = 2019,
    year_to = 1989,
    convert_to = "area",
    to_pinyin = TRUE
  )
)

# Regions with special spelling
regioncode(
  data_input = c("山西", "陕西", "内蒙古", "香港", "澳门"),
  year_from = 2019,
  year_to = 2008,
  convert_to = "name",
  incomplete_name = TRUE,
  province = TRUE,
  to_pinyin = TRUE
)
```

## Provinces

The `regioncode` function also converts data at the provincial level. 
By setting the `province` argument to `TRUE`, users can convert provincial geocodes and names. 
Since Chinese provinces have standard abbreviations, users can convert these abbreviations to other data types by setting the `convert_to` argument to `abbreTocode`, `abbreToname`, or `abbreToarea`. 
To convert names or codes *to* abbreviations, a user can set `convert_to = "abbre"`. 
The following example demonstrates converting a vector of provincial geocodes into their official names and abbreviations:

```{r}
#| label: provinces

tibble(
  province = corruption$province_id,
  prov_name = regioncode(
    data_input = corruption$province_id,
    convert_to = "name",
    year_from = 2019,
    year_to = 1989,
    province = TRUE
  ),
  prov_abbre = regioncode(
    data_input = corruption$province_id,
    convert_to = "codeToabbre",
    year_from = 2019,
    year_to = 1989,
    province = TRUE
  )
)
```

## Geographic Units Beyond Provinces

`regioncode` can also convert geographic units beyond the provincial level into two larger categories: administrative areas and linguistic zones.

### Administrative Area

For social, political, and military purposes, China divides its territory into seven major administrative areas [@SunPing2020]:

| Region | Provincial-level Administrative Unit                           |
|:-------|----------------------------------------------------------------|
| 华北   | 北京市, 天津市, 山西省, 河北省, 内蒙古自治区                   |
| 东北   | 黑龙江省, 吉林省, 辽宁省                                       |
| 华东   | 上海市, 江苏省, 浙江省, 安徽省, 福建省, 台湾省, 江西省, 山东省 |
| 华中   | 河南省, 湖北省, 湖南省                                         |
| 华南   | 广东省, 海南省, 广西壮族自治区, 香港特别行政区, 澳门特别行政区 |
| 西南   | 重庆市, 四川省, 贵州省, 云南省, 西藏自治区                     |
| 西北   | 陕西省, 甘肃省, 青海省, 宁夏回族自治区, 新疆维吾尔自治区       |

Users who need to identify the administrative area for a given prefecture or province can do so with `regioncode`. 
The function converts regional codes and names (for both prefectures and provinces) into their corresponding administrative area by setting the `convert_to` argument to `"area"`:

```{r}
#| label: 2area

regioncode(
  data_input = corruption$prefecture,
  year_from = 2019,
  year_to = 1989,
  convert_to = "area")
```

### Linguistic Zone

As a country with numerous dialects, China's linguistic distribution presents challenges for relevant economic, political, and sociological analyses. 
A single dialect may span several prefectures or even multiple provinces. 
For political and sociolinguistic researchers, `regioncode` can identify the approximate linguistic zone for a given geocode or prefectural name. 
Following the 1987 language atlas of China, the package provides two levels of linguistic zone identification: dialect groups (`dia_group`, "方言大类") and dialect sub-groups (`dia_sub_group`, "分区片"). 
Note that when `province = TRUE`, conversions are only possible to the dialect group level.

The following example converts the sample data to both dialect groups and sub-groups. 
Because China's linguistic distribution is too complex and dynamic to measure precisely at the prefectural level, the linguistic zone output from `regioncode` should be used for reference only, not for rigorous linguistic research.

```{r}
#| label: language_zone

tibble(
  city = corruption$prefecture,
  dialectGroup = regioncode(
    data_input = corruption$prefecture,
    year_from = 2019,
    year_to = 1989,
    to_dialect = "dia_group"
  ),
  dialectSubGroup = regioncode(
    data_input = corruption$prefecture,
    year_from = 2019,
    year_to = 1989,
    to_dialect = "dia_sub_group"
  )
)
```


# Conclusion

`regioncode` provides a convenient tool for converting Chinese administrative division codes and official names and for performing other specialized conversions. 
We are actively developing the package and plan to add more administrative levels and richer data in future versions. 
We welcome collaboration, and users can direct any questions, comments, or bug reports to our Github Issues page.

We extend our appreciation to LI Ruizhe, ZHU Meng, SHI Yuyang, XU Yujia, PAN Yuxin, TIAN Haiting, SHAO Weihang, CHEN Yuanqian, and LIU Xueyan for their contributions to data collection and code development for this package.


# Reference

::: {#refs}
:::


# Affiliation

HU Yue

Department of Political Science,\
Tsinghua University,\
Email: [yuehu\@tsinghua.edu.cn](mailto:yuehu@tsinghua.edu.cn){.email}\
Website: <https://www.drhuyue.site>

YE Xinyi

Department of Political Science,\
Tsinghua University,\
Email: [yexy23\@mails.tsinghua.edu.cn](mailto:yexy23@mails.tsinghua.edu.cn){.email}