40. Summary Functions in R
Function Name Description
names() Functions to get or set the names of an object
head(), tail()
Returns the first or last parts of a vector, matrix, table,
data frame or function
str()
Compactly display the internal structure of
an R object
summary() Produce result summaries
dim() Retrieve or set the dimension of an object
length() Get or set the length of vectors
complete.cases()
Return a logical vector indicating which cases are
complete, i.e., have no missing values
as.Data()
Convert between character representations and
objects of class "Date" representing calendar dates
40
EDA - 講解 B-01
41. Visualization Functions in R
Function Name Description
plot() Generic function for plotting of R objects
boxplot() Produce box-and-whisker plot(s) of the given
(grouped) values
hist() Computes a histogram of the given data values
barplot() Creates a bar plot with vertical or horizontal bars
arrows() Draw arrows between pairs of points
abline() a, b: the intercept and slope, single values.
y = [A] + [B]x
lines() Join the corresponding points with line segments.
41
Function name and parameter 的縮寫解釋:
http://jeromyanglim.blogspot.tw/2010/05/abbreviations-of-r-commands-explained.html
EDA - 講解 B-01
42. session_B_eda.R
# load in apple daily article
> d <- read.csv(“df_article.csv”, fileEncoding = “UTF-
8”)
# use dim() to know data frame dimension
> dim(d)
[1] 3784 17
# check the column names
> names(d)
[1] "aid" "case.closed" "circulation"
[4] "date.funded" "date.published" "donation"
[7] "donor" "journalist" "n.fb.comment"
[10] "n.fb.like" "n.fb.share" "n.fb.total"
[13] "n.image" "n.word" "title"
[16] "url.article" "url.detail"
讀入資料與看一看變數
42
EDA - 講解 B-01
43. # use str() to have a brief data summary
> str(d)
利用 str() 迅速了解資料格式
43
EDA - 講解 B-01
90. > library(jiebaR)
# initiate segmentation engine
> cutter = worker(bylines = T)
# cooler way to do segmentation
> article_words = lapply(article_txt, function(x)
cutter <= x)
# traditional way
> article_words = lapply(article_txt, function(x)
segment(x, cutter))
# check if all got segmented
> print(len(article_words))
# adjust to the format for text2vec::itoken
> article_words = lapply(article_words, '[[', 1)
> save(article_words, file =
'data/list_article_words(jieba).RData')
利用 jiebaR 斷詞
90
資料礦工- 講解 C-01
91. 詞的向量化
text2vec
作者 Dmitriy Selivanov 俄羅斯人
支持 the state of art word embeddings (GloVe)
Count-based Model
https://cran.r-project.org/web/packages/text2vec/index.html
word2vec
Tomas Mikolov 領軍的 Google Brain Team 研究團隊
開發
Predictive Model
資料礦工- 講解 C-02
107. # calculate the ratio of word clusters per article
>
計算文章的文字群比例
107
資料礦工- 練習 C-04
108. # check the correlation between word clusters and
# the variables we care
> i <- grep('^k|donation|donor|log|n.fb|ttl',
names(d))
> View(cor(d[,i]))
用相關性來觀察影響
108
資料礦工- 練習 C-04