[R Project 02-1] 전세계 모든 국가, 일부 도시명 리스트 영어, 한국어, 일본어 번역(Google Translate API)

2018. 4. 6. 17:23

# 1-1. 파일 불러오기, 사용할 필드 선택하기

country_city01 <- read.csv("world-cities.csv", stringsAsFactors = FALSE)

head(country_city01)

> country_city01 <- read.csv("world-cities.csv", stringsAsFactors = FALSE)

> head(country_city01)

name country subcountry geonameid

1 les Escaldes Andorra Escaldes-Engordany 3040051

2 Andorra la Vella Andorra Andorra la Vella 3041563

3 Umm al Qaywayn United Arab Emirates Umm al Qaywayn 290594

4 Ras al-Khaimah United Arab Emirates Raʼs al Khaymah 291074

5 Khawr Fakkān United Arab Emirates Ash Shāriqah 291696

6 Dubai United Arab Emirates Dubai 292223

세계의 모든 국가, 도시명이라고는 했지만 정확한 자료를 구하기가 힘들어 MySQL에서 제공하는 World DB 안의 테이블을 사용하였습니다. 도시명(name), 국가명(country) 등 4개의 필드로 구성된 것을 확인할 수 있습니다. 본 프로젝트에서는 도시명과 국가명이 필요하므로 아래와 같이 필요한 필드만 따로 저장하겠습니다.

country_city02 <- country_city01 %>% select(name, country)

head(country_city02)

> country_city02 <- country_city01 %>% select(name, country)

> head(country_city02)

name country

1 les Escaldes Andorra

2 Andorra la Vella Andorra

3 Umm al Qaywayn United Arab Emirates

4 Ras al-Khaimah United Arab Emirates

5 Khawr Fakkān United Arab Emirates

6 Dubai United Arab Emirates

# 1-2. 구글 translate API 사용 위해 API key 불러오기

out <- read_rds("api_key.enc.rds")

api_key <- decrypt_envelope(out$data, out$iv, out$session, "/.ssh/id_api", password="") %>%

unserialize()

한국어, 일본어 필드를 추가하기 위해 구글에서 제공하는 translate API를 이용해 자동번역을 했습니다. translate API key 발급에 관한 내용은 아래 포스팅을 참조해주세요. 저 같은 경우 api key를 미리 암호화해두었기 때문에 키를 불러오는 과정을 거쳤습니다.

(참고) 데이터 자동 번역을 위한 Google translate API Key 발급 (예정)

# 1-3. 영문 도시명 한국어/일본어로 번역하기

country_city03 <- translateR::translate(dataset = country_city02,

content.field = 'name',

google.api.key = api_key,

source.lang = 'en',

target.lang = 'ko')

country_city03 <- country_city03 %>% select(city_ko = translatedContent)

country_city03 <- str_replace_all(country_city03$city_ko, "[a-zA-Z]+", " ") %>%

as.data.frame()

country_city03 <- str_replace_all(country_city03$., "[0-9]+", " ") %>% as.data.frame()

country_city03 <- str_replace_all(country_city03$., "\\W", " ") %>% as.data.frame()

country_city03 <- str_replace_all(country_city03$., "[à-źÀ-Ź]+", " ") %>% as.data.frame()

country_city03 <- str_replace_all(country_city03$., "\\s", "") %>% as.data.frame()

head(country_city03)

> head(country_city03) city_ko 1 레에스칼데스 2 안도라라벨라 3 음알케이웨이 4 라스알카이마 5 6 두바이

영문 도시명이 담긴 데이터프레임(dataset)에서 해당 필드(content.field)를 영어(source.lang)에서 한국어(target.lang)로 번역하였습니다. 패키지 'translate'와 'translateR'이 충돌하여 사용 패키지를 코드에 명시해 주었는데, 'translateR'만 로드하신 분들은 [translateR::]부분은 제외하고 코드 작성하시면 될 것 같습니다. 또한, 원본 데이터가 영문이라고 표기되어 있지만 핀란드어, 스페인어 등에서 나타나는 알파벳이 좀 있고, Google에서 인식하지 못하는 값들이 있어 불량 값들을 제거하는 과정을 거쳤습니다. 결과를 헤드로 출력한 결과 번역이 잘 되었음을 확인할 수 있었습니다. 일본어 및 국가명 번역도 위와 동일하게 진행하겠습니다.

country_city04 <- translateR::translate(dataset = country_city02,

content.field = 'name',

google.api.key = api_key,

source.lang = 'en',

target.lang = 'ja')

country_city04 <- country_city04 %>% select(city_ja = translatedContent)

country_city04 <- str_replace_all(country_city04$city_ja, "[a-zA-Z]+", " ") %>% as.data.frame()

country_city04 <- str_replace_all(country_city04$., "[0-9]+", " ") %>% as.data.frame()

country_city04 <- str_replace_all(country_city04$., "\\W", " ") %>% as.data.frame()

country_city04 <- str_replace_all(country_city04$., "[à-źÀ-Ź]+", " ") %>% as.data.frame()

country_city04 <- str_replace_all(country_city04$., "\\s", "") %>% as.data.frame()

head(country_city04)

> head(country_city04)

city_ja

2 アンドララベリャ

4 ラスアルカイマ

6 ドバイ

# 1-4. 영문 국가명 한국어/일본어로 번역하기

country_city05 <- translateR::translate(dataset = country_city02,

content.field = 'country',

google.api.key = api_key,

source.lang = 'en',

target.lang = 'ko')

country_city05 <- country_city05 %>% select(country_ko=translatedContent)

head(country_city05)

> head(country_city05) country_ko 1 안도라 2 안도라 3 아랍 에미리트 4 아랍 에미리트 5 아랍 에미리트 6 아랍 에미리트

country_city06 <- translateR::translate(dataset = country_city02,

content.field = 'country',

google.api.key = api_key,

source.lang = 'en',

target.lang = 'ja')

country_city06 <- country_city06 %>% select(country_ja=translatedContent)

head(country_city06)

> head(country_city06)

country_ja

1 アンドラ

2 アンドラ

3 アラブ首長国連邦

4 アラブ首長国連邦

5 アラブ首長国連邦

6 アラブ首長国連邦

국가명의 경우 원본 데이터(영문)에서 구글 번역기가 인식 가능한 값만 담겨있어 오류 값 없이 모두 정상적으로 번역되었습니다. 따라서 불량 값을 제거하는 과정은 따로 진행하지 않았습니다.

# 1-5. 최종 정제(중복 제거) 후 파일로 저장하기

country_city07 <- country_city02 %>%

mutate(country_city_ko=country_city03$city_ko, country_ko=country_city05$country_ko,

country_city_ja=country_city04$city_ja, country_ja=country_city06$country_ja)

table(duplicated(country_city07$city))

> table(duplicated(country_city07$city))

FALSE TRUE

16224 957

'duplicated'의 테이블을 출력한 결과 도시명에서 중복되는 값이 957개가 있는 걸 알게되었습니다. 도시명이 중복되어도 필드 내의 다른 칼럼의 값이 달라 'unique'를 통해서는 제거하기 힘드니 'filter'를 이용하여 TRUE, FALSE 출력에 대한 조건을 이용하여 중복되는 값을 제거해주겠습니다. 조건을 이용할 경우 중복되는 값들 중 하나만 남게 됩니다. 중복에 대한 자세한 내용은 아래 포스팅을 참조해주세요.

(참고) 데이터프레임에서 중복값이 있는 필드(행) 제거하기(예정)

country_city <- country_city07 %>% filter(duplicated(test07$city)=="FALSE")

str(country_city07)

str(country_city)

> str(country_city07)

'data.frame': 17182 obs. of 6 variables:

$ city : chr "les Escaldes" "Andorra la Vella" "Umm al Qaywayn" "Ras al-Khaimah" ...

$ country : chr "Andorra" "Andorra" "United Arab Emirates" "United Arab Emirates" ...

$ city_ko : chr "레에스칼데스" "안도라라벨라" "음알케이웨이" "라스알카이마" ...

$ country_ko: chr "안도라" "안도라" "아랍 에미리트" "아랍 에미리트" ...

$ city_ja : chr NA "アンドララベリャ" NA "ラスアルカイマ" ...

$ country_ja: chr "アンドラ" "アンドラ" "アラブ首長国連邦" "アラブ首長国連邦" ...

> str(country_city)

'data.frame': 16225 obs. of 6 variables:

$ city : chr "les Escaldes" "Andorra la Vella" "Umm al Qaywayn" "Ras al-Khaimah" ...

$ country : chr "Andorra" "Andorra" "United Arab Emirates" "United Arab Emirates" ...

$ city_ko : chr "레에스칼데스" "안도라라벨라" "음알케이웨이" "라스알카이마" ...

$ country_ko: chr "안도라" "안도라" "아랍 에미리트" "아랍 에미리트" ...

$ city_ja : chr NA "アンドララベリャ" NA "ラスアルカイマ" ...

$ country_ja: chr "アンドラ" "アンドラ" "アラブ首長国連邦" "アラブ首長国連邦" ...

중복값을 제거한 결과 17,182개였던 데이터가 16,225개로 줄었음을 확인할 수 있었습니다. 사실... 중복제거를 미리 하지 않아서 뒤의 과정을 진행하던 중 문제가 생겼습니다. 뒤늦게 중복 제거 하고, 원본 데이터와 가공된 데이터를 다시 끼워 맞추느라 고생한건... 많은 시간이 버려진 건!!! 안비밀입니다!!! 전처리가 다 끝났다고 생각될 때 혹시 모르니 중복 데이터가 있는지 꼭! 꼭! 꼭! 확인해보시는걸 추천드립니다. (전처리를 잘하자!!!)

write.csv(country_city, "country_city.csv")

head(country_city)

> head(country_city)

city country city_ko country_ko city_ja country_ja

1 les Escaldes Andorra 레에스칼데스 안도라 アンドラ

2 Andorra la Vella Andorra 안도라라벨라 안도라 アンドララベリャアンドラ

3 Umm al Qaywayn United Arab Emirates 음알케이웨이 아랍 에미리트 アラブ首長国連邦

4 Ras al-Khaimah United Arab Emirates 라스알카이마 아랍 에미리트 ラスアルカイマアラブ首長国連邦

5 Khawr Fakkān United Arab Emirates 아랍 에미리트 アラブ首長国連邦

6 Dubai United Arab Emirates 두바이 아랍 에미리트 ドバイアラブ首長国連邦

작업이 끝난 4개의 데이터를 합쳐 'write.csv'를 이용해 파일로 저장해두었습니다. 다음 단계에서는 위 자료의 영문 도시명을 이용하여 위도, 경도를 얻어오겠습니다.

저작자표시 비영리 변경금지

'R project > 02' 카테고리의 다른 글

[R Project 02] 인스타그램 해시태그(#) 검색 시 관련 여행 국가 출력 (0)	2018.04.06
[R Project 02-2] 도시의 위도, 경도 얻기(ggmap, Google Map API) (0)	2018.04.06
[R Project 02-3] 공항 리스트 얻어오기(출처: ICAO) (0)	2018.04.06

나르는 다루루

[R Project 02-1] 전세계 모든 국가, 일부 도시명 리스트 영어, 한국어, 일본어 번역(Google Translate API)

'R project > 02' 카테고리의 다른 글

+ Recent posts

티스토리툴바