This is the project I finished in NTU 2018 Data Science Programming class, intructed by Pecu Tsai. The main purpose is to use several conceps including crawler design, data visualization, text cleaning, data manipulation, and basic text mining.
This project is my trial to extract data from PTT Koreadrama, and I tried to focus on LIVE article of “Korean Odyssey,” which is traslated into “花遊記” or “和遊記” in Chinese.
There are five parts in this project. First, I start with writing the crawler for crawling the article needed.
Second, I try to do some descriptive statistics and visualize the output to summarize the data.
Third, I calculate term frequency overall and each article respectively, then calculating tf-idf.
Fourth, I try to apply topic modeling in the 22 articles.
Let’s started!
Preparation: load the package needed
library(tidyverse)
library(tidytext)
library(rvest)
library(lubridate)
library(topicmodels)
library(jiebaR)
rm(list = ls())
windowsFonts(BL = windowsFont("微軟正黑體"))
options(stringsAsFactors = F)
#Sys.setlocale(locale = "UTF-8")
assign("last.warning", NULL, envir = baseenv())
Article searching: Find the article which is titled “[LIVE] 花遊記” or “[LIVE] 和遊記”.
part1 = "https://www.ptt.cc/bbs/KoreaDrama/index"
part2 = ".html"
# to crawl the articles needed
abc <- paste0(part1, 1950:2025 , part2) %>%
map(read_html)
# to get all titles
abc_title <- abc %>%
map(html_nodes,css= ".title a") %>%
map(html_text)
# to check which article has the needed title
check <- str_detect(unlist(abc_title), pattern = "\\LIVE] tvN 和遊記")
sum(check)
## [1] 22
There are 22 articles in total. To get the needed url, I am going to use the check
index.
abc_link <- abc %>%
lapply(html_nodes , css= ".title a") %>%
lapply(html_attr, "href") %>%
unlist()
urls <- abc_link[check]
article_links <- paste0("https://www.ptt.cc", urls)
Crawler creating: Use PTT crawler to get articles.
ptt_article_crawler <- function(url) {
#main content
main_author = ".article-metaline:nth-child(1) .article-meta-value"
main_title = ".article-metaline-right+ .article-metaline .article-meta-value"
main_datetime = ".article-metaline+ .article-metaline .article-meta-value"
main_content = "#main-content"
main_link = ".f2 a"
#push content
push_tag = ".push-tag"
push_id = ".push-userid"
push_text = ".push-content"
push_datetime = ".push-ipdatetime"
# css argument
main_column = c(main_author, main_title, main_datetime, main_content, main_link)
push_column = c(push_tag, push_id, push_text, push_datetime)
# main and push as two empty lists
main = list()
push = list()
raw_html <- url %>% read_html()
# for loop
for(i in 1:length(main_column)){
content <- raw_html %>% html_nodes(css = main_column[i]) %>% html_text()
main[[i]] <- content
}
main <- as.data.frame(matrix(unlist(main), nrow = 1))
colnames(main) <- c("main_author", "main_title", "main_datetime", "main_content", "main_link")
for(i in 1:length(push_column)){
content <- raw_html %>% html_nodes(css = push_column[i]) %>% html_text()
push[[i]] <- content
}
push <- as.data.frame(push)
colnames(push) <- c("push_tag", "push_id", "push_text", "push_datetime")
# clean
main$main_content <- main$main_content %>%
str_replace_all("\n", replacement = "") %>% # 清理斷行符號
str_replace_all("--※.+", replacement = "") %>% # 去尾
str_replace_all("作者.+:[0-9]{2}\\s[0-9]{4}", replacement = "") %>% #sentfromy
str_replace_all("-----Sent from.+", replacement = "")
main$weekday <- main$main_datetime %>% str_sub(1,4)
main$datetime <- main$main_datetime %>% str_sub(5,-1) %>% parse_datetime("%b %d %H:%M:%S %Y")
# main$authorip <- main$authorip %>% str_sub(str_locate(., "來")[1] + 4 , -2)
#######
push$push_text <- push$push_text%>% str_replace(pattern = ":", replacement = "") %>% str_replace_all(pattern = "\\s", replacement = "")
#2018/04/0100:42
push$push_datetime <- toString(year(main$datetime)) %>% str_c(push$push_datetime, sep = "/" ) %>% str_replace_all(" ", "")
push$date <- push$push_datetime %>% str_sub(1,10) %>% parse_date("%Y/%m/%d", locale = locale(tz = "Asia/Taipei"))
push$time <- push$push_datetime %>% str_sub(11,-1) %>% parse_time("%H:%M")
push$weekday <- wday(push$date, label = T)
push$month <- month(push$date)
push$day <- day(push$date)
push$hour <- hour(push$time)
push$minute <- minute(push$time)
#######
article <- list("main" = main, "push" = push)
return(article)
}
# use the function
article_Korean_Odyssey = list()
for(i in 1:length(article_links)) {
article_Korean_Odyssey[[i]] = ptt_article_crawler(article_links[i])
}
names(article_Korean_Odyssey)= str_c("article", c(1:22))
There are two parts of this crawler. First is to list all of the CSS tag. Second is to use for
loop to crawl articles. Second is to clean the text using stringr
package and get the detail of date and time using lubridate
package .
Push text extracting: Extract tweets and add article ID.
article_Korean_Odyssey_push = tibble()
for(i in 1:22) {
article_Korean_Odyssey_push <- article_Korean_Odyssey[[i]]$push %>%
mutate(article = i) %>% rbind(article_Korean_Odyssey_push)
}
Originally I have main content and tweet(push) content. Since live streaming articles’ main content is the introduction of the drama, there is no needed to preserve it. Thus, here I only analyze tweet(push) content. However, it is important to crawl main content since main content of other articles such as drama review are important. The design of crarwler shold preserve it for further use.
Descriptive statistics: Visualize time distribution of tweets each article.
article_descriptive <- select(article_Korean_Odyssey_push, article, date, time, month, day, hour, minute)
article_descriptive <- as_tibble(article_descriptive)
article_descriptive
## # A tibble: 11,563 x 7
## article date time month day hour minute
## <int> <date> <time> <dbl> <int> <int> <int>
## 1 22 2018-03-04 19:44 3.00 4 19 44
## 2 22 2018-03-04 19:44 3.00 4 19 44
## 3 22 2018-03-04 19:45 3.00 4 19 45
## 4 22 2018-03-04 19:46 3.00 4 19 46
## 5 22 2018-03-04 19:46 3.00 4 19 46
## 6 22 2018-03-04 19:47 3.00 4 19 47
## 7 22 2018-03-04 19:48 3.00 4 19 48
## 8 22 2018-03-04 19:49 3.00 4 19 49
## 9 22 2018-03-04 19:49 3.00 4 19 49
## 10 22 2018-03-04 19:50 3.00 4 19 50
## # ... with 11,553 more rows
There are combined date and time and individual columns as well. The former is for read and the latter is for manipulating data such as cutting them into intervals or filtering and mutating. Now I am going to visualize time distribution of tweets each article.
for(i in 1:22) {
print(article_descriptive %>%
filter(article == i ) %>%
filter(date == date[1]) %>%
mutate(mingroup = factor(floor(minute/10)), hourgroup = factor(hour, ordered = T)) %>%
select(-date, -time) %>%
group_by(month, day, hourgroup, mingroup) %>%
count() %>%
ungroup %>%
mutate(minute = as.numeric(mingroup)*10, hour = hourgroup) %>%
unite("HM", c("month", "day", "hour", "minute"), sep = ":") %>%
ggplot() +
geom_col(aes(HM, n, fill = hourgroup), show.legend = FALSE) + geom_line(aes(HM, n, group = 1), alpha = 1, col = "blue") +
labs(x = "時間", y = "出現次數", title = paste0("episode", i)) +
theme(plot.title = element_text(family = "BL", color = "black"), axis.title.y = element_text(angle = 90, family = "BL", color = "black"),
axis.text.x = element_text(angle = 60, family = "BL", hjust = 1, color = "black"),
axis.title.x = element_text(family = "BL", color = "black", face = "bold"))
)
}
Word segmenting: Use jiebar
package to segment the word.
# select push_text
article_tidytext <- select(article_Korean_Odyssey_push, push_text, article)
article_tidytext <- article_tidytext %>% as_tibble()
# create dictionary
KOword <- c("花遊記", "和遊記", "悟空", "孫悟空", "唐三藏", "昇基", "三藏", "漣序", "八戒", "牛魔王", "魔王", "黑龍", "503","阿斯女", "金剛箍", "齊天大聖", "六空", "韓周", "悟淨", "死鈴", "愛鈴")
#length(KOword)
KOword_tag <- rep("n", 21)
# set the worker
cutting = worker(bylines = TRUE)
# add the dict to worker
new_user_word(cutting, KOword, KOword_tag)
## [1] TRUE
# add stop words
cutting = worker(stop_word = "stopwords_chi_updated.csv")
# segment the tweets
tidy_KO_jiebar2 <- list()
for(i in 1:22) {
article_i <- article_tidytext %>% filter(article == i)
tidy_KO_jiebar2[[i]] <- segment(article_i$push_text, cutting)
}
# convert to tibble
tidy_KO_jiebar2_tibble <- tibble()
for(i in 1:22) {
gawy <- as_tibble(tidy_KO_jiebar2[[i]]) %>% mutate(article = i)
tidy_KO_jiebar2_tibble <- rbind(tidy_KO_jiebar2_tibble, gawy)
}
Tibble is the revised version of data.frame, developed in tibble
under tidyverse
. I am accustommed to manipulate tibble so I usually convert data to tibble type.
Anti join: Anti join stop_words to remove unwanted words.
# there are still stopwords so anti_join
names(tidy_KO_jiebar2_tibble) <- c("word", "article")
stop_words_chinese <- read_csv("stopwords_chi_updated.csv", col_names = F)
names(stop_words_chinese) <- c("number", "word")
tidy_KO_jtu <- tidy_KO_jiebar2_tibble %>%
anti_join(stop_words_chinese, by = "word")
stupid <- tibble(
word = c("XD", "推", "會", "應該", "覺得", "XDDD", "喔", "XDD", "QQ", "哈哈", "哈哈哈", "沒", "說"),
number = seq(747, 759, 1)
)
stop_words_chinese2 <- rbind(stop_words_chinese, stupid)
Term frequency: Get the high frequency terms overall.
tidy_KO_jtu %>%
anti_join(stop_words_chinese2, by = "word") %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
count(word, sort = T) %>%
mutate(word = reorder(word, n)) %>%
ungroup() %>%
top_n(30)
## # A tibble: 31 x 2
## word n
## <fct> <int>
## 1 悟空 1275
## 2 三藏 904
## 3 真的 755
## 4 阿斯 547
## 5 魔王 514
## 6 女 467
## 7 50 425
## 8 編劇 296
## 9 想 275
## 10 不會 267
## # ... with 21 more rows
Term frequency vis: Visualize the high frequency terms overall.
tidy_KO_jtu_vis <- tidy_KO_jtu %>%
anti_join(stop_words_chinese2, by = "word") %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
count(word, sort = T) %>%
mutate(word = reorder(word, n)) %>%
#arrange(desc(n)) %>%
ungroup() %>%
top_n(30) %>%
ggplot(aes(x = word, y = n)) +
geom_col(show.legend = FALSE) +
ggtitle("整體高頻詞彙") +
labs(x = "高頻詞", y = "出現次數")
windowsFonts(BL = windowsFont("微軟正黑體"))
tidy_KO_jtu_vis + theme(plot.title = element_text(family = "BL", color = "black"), axis.title.y = element_text(angle = 90, family = "BL", color = "black"),
axis.text.x = element_text(angle = 60, family = "BL", hjust = 1, color = "black"),
axis.title.x = element_text(family = "BL", color = "black", face = "bold"))
After visualizing the whole corpus, now it’s tiem to target each article.
Term frequency: Get the high frequency terms each article.
tidy_KO_jtu_eachterm <- tidy_KO_jtu %>%
anti_join(stop_words_chinese2, by = c("word")) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
count(article, word, sort = T) %>%
mutate(word = reorder(word, n)) %>%
ungroup() %>%
group_by(article) %>%
top_n(5) %>%
arrange(article)
Term frequency vis: Visualize the high frequency terms each article.
for(i in 1:22) {
print(
tidy_KO_jtu_eachterm %>%
filter(article == i) %>%
ggplot() +
geom_col(aes(word, n), fill = "white", colour = "red", show.legend = F) +
labs(title = paste0("episode", i),
x ="高頻詞", y = "出現次數")) +
theme(plot.title = element_text(family = "BL", color = "black"), axis.title.y = element_text(angle = 90, family = "BL", color = "black"),
axis.text.x = element_text(angle = 60, family = "BL", hjust = 1, color = "black"),
axis.title.x = element_text(family = "BL", color = "black", face = "bold"))
}
Tf-idf Calculation: Calculate tf-idf.
tidy_KO_jiebar2_count <- tidy_KO_jiebar2_tibble %>% group_by(article) %>%
count(article, word, sort = T) %>%
ungroup()
tidy_KO_tfidf <- tidy_KO_jiebar2_count %>%
bind_tf_idf(word, article, n)
# preseve the top 30
tidy_KO_tfidf %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
mutate(word = reorder(word, tf_idf)) %>%
top_n(30) %>%
ungroup
## # A tibble: 30 x 6
## article word n tf idf tf_idf
## <int> <fct> <int> <dbl> <dbl> <dbl>
## 1 3 NG 11 0.00842 3.09 0.0260
## 2 6 生日快樂 7 0.00511 3.09 0.0158
## 3 5 情痴 8 0.00509 3.09 0.0157
## 4 4 一切順利 8 0.00485 3.09 0.0150
## 5 14 阿斯 77 0.0209 0.693 0.0145
## 6 16 阿斯 64 0.0197 0.693 0.0137
## 7 21 冥界 56 0.00559 2.40 0.0134
## 8 7 吳海英 11 0.00537 2.40 0.0129
## 9 2 事故 16 0.00748 1.70 0.0128
## 10 3 修正版 5 0.00383 3.09 0.0118
## # ... with 20 more rows
Tf-idf Visualization: Visaulizate tf-idf.
tidy_KO_tfidf_vis <- tidy_KO_tfidf %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
top_n(30) %>%
ungroup %>%
mutate(word = reorder(word, tf_idf)) %>%
ggplot(aes(x = word, y = tf_idf)) +
geom_col(show.legend = FALSE, fill = "white", col = "red") +
ggtitle("整體高tf_idf詞彙") +
labs(x = "高tf_idf詞", y = "tf_idf")
tidy_KO_tfidf_vis + theme(axis.title.y = element_text(angle = 90, family = "BL", color = "black"),
axis.text.x = element_text(angle = 60, family = "BL", hjust = 1, color = "black"),
axis.title.x = element_text(family = "BL", color = "black", face = "bold"))
Topic modeling trial: Use topicmodels
to divide two topics.
test <- tidy_KO_jtu %>%
anti_join(stop_words_chinese2, by = c("word")) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
count(article, word, sort = T)
# from tidytext to TermDocumentMatrix
gawy = test %>%
cast_dtm(article, word, n)
# Hypothsize that there are two topics
gawy_lda <- LDA(gawy, k = 2, control = list(seed = 1234))
gawy_topics <- tidy(gawy_lda, matrix = "beta")
Topic modeling vis: Compare topic 1 and topic 2.
# preserve top 10
gawy_top_terms <- gawy_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
# data vis
gawy_top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
It is quite clear that there are not differences between two topics. We can say that tweets of 22 article don’t have big differences.
Topic modeling check: Double check whethere there are differnces between two topics.
gawy_documents <- tidy(gawy_lda, matrix = "gamma")
gawy_documents %>%
filter(gamma >= 0.1)
## # A tibble: 23 x 3
## document topic gamma
## <chr> <int> <dbl>
## 1 21 1 1.000
## 2 22 1 1.000
## 3 9 1 1.000
## 4 11 1 1.000
## 5 10 1 0.267
## 6 8 1 1.000
## 7 7 1 1.000
## 8 6 1 1.000
## 9 4 1 1.000
## 10 5 1 1.000
## # ... with 13 more rows
It is also clear that almost every article contain only one topic. The hypothesis is rejected.
Conclusion
In this porject, I applied the concepts learned and tried many useful packages such as dplyr
, tidyr
, and stringr
. There are some problems I can’t deal with. For example, I spent much time in setting ggplot2
arguments due to the unfamiliarity. I am looking forward to next project!