Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Data manipulation and visualization in r 20190711 myanmarucsy

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 97 Anuncio

Más Contenido Relacionado

Presentaciones para usted (19)

Similares a Data manipulation and visualization in r 20190711 myanmarucsy (20)

Anuncio

Más reciente (20)

Data manipulation and visualization in r 20190711 myanmarucsy

  1. 1. Data Manipulation and Visualization in R Hyebong Choi School of ICT and Global Entrepreneurship Handong Global University
  2. 2. Big Data: Why it matters 2  In the past, data(record) are only for important people and events  e.g. royal family, war records, …
  3. 3. Big Data: Why it matters 3  In the past, data(record) are expensive and for very important event or a few privileged people  e.g. royal family, war records, …  Now data is for everyone and every moment
  4. 4. Data easily go beyond Human and … 4
  5. 5. Data easily go beyond Human and computer 5 Moore’s Law is a computing term which originated around 1970; the simplified version of this law states that processor speeds, or overall processing power for computers will double every two years.
  6. 6. 6
  7. 7. What we call Big Data 7 3V definition
  8. 8. Data Science  Data Science aims to derive knowledge fr om big data, efficiently and intelligently  Data Science encompasses the set of acti vities, tools, and methods that enable da ta-driven activities in science, business, medicine, and government  Machine Learning(or Data Mining) is agai n one of the core technologies that enabl es Data Science http://www.oreilly.com/data/free/what-is-data-science.csp 8 Data Science is process to get Actionable Insight from Big DATA
  9. 9. Night time Bus by Seoul Local Government  Effective Bus Route Design with Big Data  5 Million Records of Taxi Take on/off  3 Billion Night time call records(location) from telco company
  10. 10. Singapore Traffic visualization  Data from LTA  Subway, bus, taxi traffic info.  By data analytics team in Institute for Infocomm Research
  11. 11. Big data and Data Science
  12. 12. What We Cover Here • Data Manipulation and Visualization in R 12
  13. 13. R • Most common tools for data scientist other than DBMS • Cover wide range of data scientist – Data engineer, Statistician, Data main expert, … rather than just computer engineer • Providing thousands of ready-to-use powerful packages for data scientist • Well documented 13 http://blog.revolutionanalytics.com/2014/01/in-data-scientist-survey-r-is-the-most-used-tool-other-than-databases.html
  14. 14. R How to Start • Explanation of windows in R studio 14
  15. 15. R • Getting help ▫ help(command) or ?command ▫ example(command) to see examples 15
  16. 16. Package Installation and loading • install.packages(“package name”) • to load package ▫ library(“package name”) ▫ or require(“package name”) 16
  17. 17. Variable • Variable is a container to hold data (or information) that we want to work with • Variable can hold ▫ a single value: 10, 10.5, “abc”, factor, NA, NULL ▫ multiple values: vector, matrix, list ▫ specially formatted data (values): data.frame 17
  18. 18. How you assign a value to variable • var <- value 18 my_first_variable <- 35.121 New variable is now assigned and available in working environment
  19. 19. Operators 19 a <- 10.5 b <- 20 c <- 4 a + b ## addition ## [1] 30.5 a - c ## substraction ## [1] 6.5 a * c ## mulitiplication ## [1] 42 b / c ## division ## [1] 5 a %% c ## remainder ## [1] 2.5 a > b ## inequality ## [1] FALSE a*2 == b ## equality ## [1] FALSE !(a > b) ## negation ## [1] TRUE (b > a) & (b > c) ## logical AND ## [1] TRUE (a > b) | (a > c) ## logical OR ## [1] TRUE
  20. 20. Data Type – Missing Value (NA) • Sometimes values are missing, and R represent the missing values as NAs 20
  21. 21. Vector • A vector is a sequence of data elements of the same basic type. • All members should be of same data type 21 numeric_vector <- c(1, 10, 49) character_vector <- c("a", "b", "c") boolean_vector <- c(TRUE, FALSE, TRUE) typeof(numeric_vector) ## [1] "double" typeof(character_vector) ## [1] "character" typeof(boolean_vector) ## [1] "logical" length(numeric_vector) ## number of members in the vector ## [1] 3 new_vector <- c(numeric_vector, 50) new_vector ## [1] 1 10 49 50
  22. 22. Vector • R’s vector index starts from 1 ▫ 1,2,3,4, … • Minus Index means “except for” 22
  23. 23. Vector with named elements • We can give name to each element of vector • and we can use the name instead of index number 23 some_vector <- c("John Doe", "poker player") names(some_vector) <- c("Name", "Profession") some_vector ## Name Profession ## "John Doe" "poker player" some_vector['Name'] ## Name ## "John Doe" some_vector['Profession'] ## Profession ## "poker player" some_vector[1] ## Name ## "John Doe"
  24. 24. Vector with named elements • We can give name to each element of vector • and we can use the name instead of index number 24 weather_vector <- c("Mon" = "Sunny", "Tues" = "Rainy", "Wed" = "Cloudy", "Thur" = "Foggy", "Fri" = "Sunny", "Sat" = "Sunny", "Sun" = "Cloudy") weather_vector ## Mon Tues Wed Thur Fri Sat Sun ## "Sunny" "Rainy" "Cloudy" "Foggy" "Sunny" "Sunny" "Cloudy“ names(weather_vector) ## [1] "Mon" "Tues" "Wed" "Thur" "Fri" "Sat" "Sun"
  25. 25. Short-cut to make numeric vector 25 a_vector <- 1:10 ## numbers from 1 to 10 b_vector <- seq(1, 10, 2) ## numbers from 1 to 10 increasing by 2 a_vector ## [1] 1 2 3 4 5 6 7 8 9 10 b_vector ## [1] 1 3 5 7 9 c_vector <- rep(1:3, 3) d_vector <- rep(1:3, each = 3) c_vector ## [1] 1 2 3 1 2 3 1 2 3 d_vector ## [1] 1 1 1 2 2 2 3 3 3 c(a_vector, b_vector) ## combine vectors to single vector ## [1] 1 2 3 4 5 6 7 8 9 10 1 3 5 7 9
  26. 26. Basic Vector operations 26 a_vector <- c(1,5,2,7,8) b_vector <- seq(1, 10, 2) sum(a_vector) ## summation ## [1] 23 mean(a_vector) ## average ## [1] 4.6 # operation of Vector and Scala a_vector + 10 ## [1] 11 15 12 17 18 a_vector > 4 ## [1] FALSE TRUE FALSE TRUE TRUE sum(a_vector > 4) ## what does this mean? ## [1] 3 # operation of Vector and Vector a_vector - b_vector ## [1] 0 2 -3 0 -1 a_vector == b_vector ## [1] TRUE FALSE FALSE TRUE FALSE sum(a_vector == b_vector) ## what does this mean? ## [1] 2
  27. 27. Vector Indexing (Selection) 27 sample_vector <- c(1, 4, NA, 2, 1, NA, 4, NA) ## vector with some missing values sample_vector[1:5] ## [1] 1 4 NA 2 1 sample_vector[c(1,3,5)] ## [1] 1 NA 1 sample_vector[-1] ## [1] 4 NA 2 1 NA 4 NA sample_vector[c(-1, -3, -5)] ## [1] 4 2 NA 4 NA sample_vector[c(T, T, F, T, F, T, F, T)] ## [1] 1 4 2 NA NA is.na(sample_vector) ## [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE sum(is.na(sample_vector)) ## [1] 3 ## can you select non-NA elements from the vector? Selection by numeric vector Selection by logical vector
  28. 28. Data Frame • Very commonly datasets contains variables of different kinds ▫ e.g. student dataset may contain name(character), age(integer), major(factor), gpa(numeric, real number)… • Vector and metric can have values of same data type • A data frame has the variables of a data set as columns and the observations as rows. 28 mtcars ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 ## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 ## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 ## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 ## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 ## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 ## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
  29. 29. Overviewing of Data frame • head functions shows the first n (6 by default) observation of dataframe • tail functions shows the last n (6 by default) observation of dataframe 29 head(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 head(mtcars, 10) ## try to see what happens tail(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2 ## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2 ## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4 ## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6 ## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8 ## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2 tail(mtcars, 10) ## try to see what happens
  30. 30. Overviewing of Data frame • str function shows the structure of your data set, it tells you ▫ The total number of observations (e.g. 32 car types) ▫ The total number of variables (e.g. 11 car features) ▫ A full list of the variables names (e.g. mpg, cyl ... ) ▫ The data type of each variable (e.g. num) ▫ The first few observations 30 str(mtcars) ## 'data.frame': 32 obs. of 11 variables: ## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... ## $ disp: num 160 160 108 258 360 ... ## $ hp : num 110 110 93 110 175 105 245 62 95 123 ... ## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... ## $ wt : num 2.62 2.88 2.32 3.21 3.44 ... ## $ qsec: num 16.5 17 18.6 19.4 17 ... ## $ vs : num 0 0 1 1 0 1 0 1 1 1 ... ## $ am : num 1 1 1 0 0 0 0 0 0 0 ... ## $ gear: num 4 4 4 3 3 3 3 4 4 4 ... ## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
  31. 31. Creating Data frame • data.frame function with vectors (of same length and possibly different type) makes you a data frame 31 # Definition of vectors name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune") type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant") diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883) rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67) rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE) # Create a data frame from the vectors planets_df <- data.frame(name, type, diameter, rotation, rings) planets_df ## name type diameter rotation rings ## 1 Mercury Terrestrial planet 0.382 58.64 FALSE ## 2 Venus Terrestrial planet 0.949 -243.02 FALSE ## 3 Earth Terrestrial planet 1.000 1.00 FALSE ## 4 Mars Terrestrial planet 0.532 1.03 FALSE ## 5 Jupiter Gas giant 11.209 0.41 TRUE ## 6 Saturn Gas giant 9.449 0.43 TRUE ## 7 Uranus Gas giant 4.007 -0.72 TRUE ## 8 Neptune Gas giant 3.883 0.67 TRUE
  32. 32. Creating Data frame • you may specify the variables as parameters 32 my.df <- data.frame(name = c('John', 'Kim', 'Kaith'), job = c('Teacher', 'Policeman', 'Secertary'), age = c(32, 25, 28)) my.df ## name job age ## 1 John Teacher 32 ## 2 Kim Policeman 25 ## 3 Keith Secretary 28
  33. 33. Selection of data frame elements Similar to vectors and matrices, you select elements from a data frame with the help of square brackets [ ]. By using a comma, you can indicate what to select from the rows and the columns respectively. 33 # Print out diameter of Mercury (row 1, column 3) planets_df[1,3] ## [1] 0.382 # Print out data for Mars (entire fourth row) planets_df[4, ] ## name type diameter rotation rings ## 4 Mars Terrestrial planet 0.532 1.03 FALSE # you can use of directly variable name # Select first 5 values of diameter column planets_df[1:5, 'diameter'] ## [1] 0.382 0.949 1.000 0.532 11.209
  34. 34. Selection of data frame elements You will often want to select an entire column, namely one specific variable from a data frame. If you want to select all elements of the variable diameter, for example, both of these will do the trick: planets_df[,3] planets_df[,"diameter"] However, there is a short-cut. If your columns have names, you can use the $ sign: planets_df$diameter 34
  35. 35. Selection of data frame elements - a tricky part • You can use a logical vector to select from data frame 35 ## find planets with rings planets_df[planets_df$rings, ] ## name type diameter rotation rings ## 5 Jupiter Gas giant 11.209 0.41 TRUE ## 6 Saturn Gas giant 9.449 0.43 TRUE ## 7 Uranus Gas giant 4.007 -0.72 TRUE ## 8 Neptune Gas giant 3.883 0.67 TRUE ## select names of planets with rings planets_df[planets_df$rings, 'name'] ## [1] Jupiter Saturn Uranus Neptune ## Levels: Earth Jupiter Mars Mercury Neptune Saturn Uranus Venus ## find planets with larger diameter than earth planets_df$diameter > 1 ## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE planets_df[planets_df$diameter > 1, ] ## name type diameter rotation rings ## 5 Jupiter Gas giant 11.209 0.41 TRUE ## 6 Saturn Gas giant 9.449 0.43 TRUE ## 7 Uranus Gas giant 4.007 -0.72 TRUE ## 8 Neptune Gas giant 3.883 0.67 TRUE
  36. 36. Handling Big Data using R <2st> Package “dplyr”, essential for data preprocessing in R 36 HyunJae Jo School of ICT and Global Entrepreneurship Handong Global University
  37. 37. Contents • Data preprocessing • dplyr packages • Lab example 37
  38. 38. Why we have to do “Data preprocessing” • Data preprocessing is the process of making raw data suitable for data analysis. 38 • Data Cleaning  Correction of missing value, outlier and noisy data • Data Integration  Combining data from multiple sources • Data Reduction  Reduce only the data needed for analysis • Data Transformation  Data transformation to maximize efficiency of data analytics
  39. 39. Data Example of “Data Integration” 39 Missing value Noisy data Outlier 1. Data Integration  Integrating “Female Population” Column “Data preprocessing”
  40. 40. Example of “Data Cleaning” 40 Missing value Noisy data Outlier “Data preprocessing” 2. Data Cleaning  Correction “Male Pop.” column  Remove missing value(“others” column)  Correction outliers
  41. 41. Example of “Data Reduce” 41 Missing value Noisy data Outlier “Data preprocessing” 3. Data Reduce  Top 5 Population Regions by Total Population
  42. 42. Example of “Data Transformation” 42 Missing value Noisy data Outlier “Data preprocessing” 4. Data Transfromation  Ratio region population to total population
  43. 43. What is “dplyr” • "dplyr" is an R package that specializes in data processing. • Functions of “dplyr” ▫ mutate() adds new variables that are functions of existing variables ▫ select() picks variables based on their names. ▫ filter() picks cases based on their values. ▫ summarise() reduces multiple values down to a single summary. ▫ arrange() changes the ordering of the rows. ▫ group_by() allows to perform any operation “by group” ▫ %>%(chain) connect each operation and perform it at once 43
  44. 44. How we can use “dplyr” • Install packages and load packages. • If you want to see package manual in R, you can use “help” function 44
  45. 45. imdb dataset • IMDb (Internet Movie Database) is an online database of information related to films, television programs, home videos and video games, and streaming content online -- including cast, production crew and personnel biographies, plot summaries, trivia, and fan reviews and ratings. As of October 2018, IMDb has approximately 5.3 million titles (including episodes) and 9.3 million personalities in its database, as well as 83 million registered users. (Wikipedia) • We use reduced imdb dataset and you can load imdb dataset using read.csv() function with URL link: • https://raw.githubusercontent.com/myhan0710/prac_dplyr/master/imdb.csv 45
  46. 46. Overviewing: imdb dataset 46 (e.g. 15,190 movies) (e.g. 44 related variable) (e.g. fn, tid, title,wordsInTitle, …) (e.g. chr, num, int)
  47. 47. Lab Example • We will now look at the functions of the dplyr library through the imdb dataset. • Familiarize yourself with the roles of the given functions, and just enter the code in your RStudio. • Through the Lab, we can finally obtain data from the imdb dataset that meet the following conditions: 47
  48. 48. dplyr: filter 48 • Use filter() function to extract specific row that fits the condition. • First, we can find recent, high rating and • many rating counts movie like this:
  49. 49. dplyr: mutate 49 • Use mutate() function to make new column. • Second, we can make column that indicates whether a movie contains action and adventure genres: • If a certain movie’s Action column value is 1 and Adventure column value is 0, it means that movie includes action genre, but not Advenutre genre. • So, ActionAdven column value 1 means a movie includes both action and adventure genres.
  50. 50. dplyr: select • Use select() function to extract specific column that fits the condition. • We only need a few columns: - wordsInTitle - imdbRating - ratingCount - Year - ActionAven 50
  51. 51. dplyr: arrange 51 • Use the arrange() function to sort data from small to large values based on a specified column. • Fourth, we would like to sort the year in ascending order and then grade in descending order. • desc() function allows it to be sorted in descending order.
  52. 52. dplyr: summarise • Use the summarise() function to obtain basic statistics by specifying functions such as mean(), var(), and median() • Now, we want to know average rating, average year, and number of Action-Adventure genre. 52
  53. 53. dplyr: group_by • Use the group_by() function, then you can group data by level in the specified column. • After group_by() function, You can associate the results with summarise() to view the results of each level. 53
  54. 54. dplyr: %>% • Use the chain function(%>%) to write a code with all functions at once. • Then, the results are same. (imdb_chain, imdb7) 54
  55. 55. All contents 55
  56. 56. Excercise ① Extract movies from the original imdb dataset between 1970 and 2000. ② Using chain function, extract movies that has both drama and family genre. Then, columns are only year, duration, nrOfphotos. • Please replace <Fill In> with your solution. • After replacing your solution, check your solution is right on the next page. 56
  57. 57. Exercise key ① Key ② Key 57
  58. 58. 58
  59. 59. Data Visualization • Essential component of skill set as a data scientist • With ever increasing volume of data, it is impossible to tell stories without visualizations. • Data visualization is an art of how to turn “Big Data” into useful knowledge 59
  60. 60. Data Visualization • Data Visualization 60 Statistics Design Graphical Data Analysis Communication & Perception
  61. 61. Data Visualization • Exploratory Visualization ▫ Help you see what is in the data • Explanatory Visualization ▫ Shows others what you’ve found in your data • R supports both types of visualizations 61
  62. 62. Data Visualization • Exploratory Visualization ▫ Help you see what is in the data ▫ Keep as much as detail as possible ▫ Practical Limit: how much can you see and interpret • Explanatory Visualization ▫ Help us share our understanding with others ▫ Shows others what you’ve found in your data ▫ Requires editorial decisions: ▫ Highlight the key features you want to emphasize ▫ Eliminate extraneous details 62
  63. 63. Exploratory Visualzation 63
  64. 64. Explanatory Visualization 64
  65. 65. ggplot2 • Author: Hadley Wickham • Open Source implementation of the layered grammar of graphics • High-level R package for creating publication-quality statistical graphics ▫ Carefully chosen defaults following basic graphical design rules • Flexible set of components for creating any type of graphics • Things you cannot do With ggplot2 ▫ 3-dimensional graphics ▫ Graph-theory type graphs (nodes/edges layout) 65
  66. 66. ggplot2 installation • In R console: install.packages("ggplot2") library(ggplot2) 66
  67. 67. Toy examples 67
  68. 68. Grammar of Graphics 68
  69. 69. The quick brown fox jumps over the lazy dog 69
  70. 70. Grammar of Graphics 70 • Plotting Framework • Leland Wilkinson, Grammar of Graphics, 1999 • 2 principles ▫ Graphics = distinct layers of grammatical elements ▫ Meaningful plots through aesthetic mapping
  71. 71. Essential Grammatical Elements 71 Element Description Data The dataset being plotted. Aesthetics The scales onto which we map our data. Geometries The visual elements used for our data.
  72. 72. All Grammatical Elements 72 Element Description Data The dataset being plotted. Aesthetics The scales onto which we map our data. Geometries The visual elements used for our data. Facets Plotting small multiples Statistics Representations of our data to aid understanding. Coordinates The space on which the data will be plotted. Themes All non-data ink.
  73. 73. Diagram 73
  74. 74. Grammar of Graphics • Building blocks • Solid, creative, meaningful visualizations • Essential Layers: Data, Aesthetics, Geometries • For Enhancement: Facets, Statistics, Coordinates, Themes 74
  75. 75. Example – First trial of ggplot • To get a first feel for ggplot2, let's try to run some basic ggplot2 commands. Together, they build a plot of the mtcars dataset that contains information about 32 cars from a 1973 Motor Trend magazine. This dataset is small, intuitive, and contains a variety of continuous and categorical variables. 75
  76. 76. Example – First trial of ggplot • The plot from the previous exercise wasn't really satisfying. Although cyl (the number of cylinders) is categorical, it is classified as numeric in mtcars. You'll have to explicitly tell ggplot2 that cyl is a categorical variable. 76
  77. 77. more examples 77
  78. 78. more examples 78
  79. 79. more examples 79
  80. 80. more examples 80
  81. 81. more examples 81
  82. 82. more examples 82
  83. 83. more examples 83
  84. 84. more examples 84
  85. 85. more examples 85
  86. 86. Practice 86 https://gist.githubusercontent.com/tiangechen/b68782efa49a1 6edaf07dc2cdaa855ea/raw/0c794a9717f18b094eabab2cd6a6b9a2269 03577/movies.csv
  87. 87. Practice 87
  88. 88. Practice 88
  89. 89. Practice 89
  90. 90. Practice 90 https://raw.githubusercontent.com/meKIDO/MyanmarData/master/MichelinNY.csv'
  91. 91. Practice 91
  92. 92. Practice 92
  93. 93. Practice 93
  94. 94. Practice 94
  95. 95. Practice 95 https://raw.githubusercontent.com/meKIDO/MyanmarData/master/M yanmarDB.csv
  96. 96. Practice 96
  97. 97. Thank you 97

Notas del editor

  • Hello everyone, my name is Hyunjae Jo. I’m instructor of 2nd lectures in big data course.
    In the previous lecture, you could hear the introduction of big data and basic grammars of R.
    In this lecture, we will learn about data preprocessing and ‘dplyr’ which is the most powerful package for data preprocessing.
  • This is what we are going to do in this lecture.
    First, we will learn what data preprocessing is and why we have to do.
    And you’ll learn 4 types of data preprocessing.
    Next, we’ll learn about the dplyr package.
    You will find out what functions are in the package and what their role is.
    Lastly, we’ll use dplyr package through rap example.
  • Let’s start. What is data preprocessing? Data preprocessing is the process of making raw data suitable for data analysis.
    When you download original data, there may be unnecessary or incorrect contents in the data.
    The raw data itself is inconvenient for you to analyze and can be unnecessarily large in size.
    You should refine these original data into the data that you could use conveniently. And we call this process data preprocessing.
    There are four main types of data pre-processing, and each of which is used when a process is required.
    Data cleaning is a process of correcting incorrect values.
    And Data integration is a process of synthesizing the required data into the existing data.
    Data reduction is a process of removing unnecessary data.
    Lastly, data transformation is a process of converting raw data to the required data form.
  • Let's take a look at it with an example.
    The picture on the left is a sample of data related to Myanmar's population.
    Current data shows the total population and male population by region.
    But as you can see, there are a few wrong numbers or empty items in the data.
    You can see these minus values. We call it outlier when the figures are significantly different from the existing data.
    And others row have no value in each column, we call it missing value.
    Lastly, Male population almost has unnecessary values, and we call it noisy data.
    So, Let’s preprocess this data step by step.
    First, I think we need one more column, Female population by region.
    So, I integrate one more column “Female population”.
    We call this process Data integration.
  • Next, we have to correct wrong values like outliers, missing values, and noisy data.
    As shown in the picture on the right, the wrong values have been changed to the right ones.
    And some unnecessary values were removed. We call this process data cleaning.
  • Now, we want to select a top five areas of the population and analyze this data. In other words, without top five population regions, the other data is not needed. I pick out only the top five population areas, and the rest of the data is removed. This process is called data redirection.
  • Finally, with top 5 population regions data, we want to know ratio of the population by region.
    So, I divide the population of each region by the total population.
    This enabled me to get a percentage of the population from region to region.
    We call this process data transformation.
    Through four preprocesses of data, we could able to obtain the percentage of population in each region
    from the raw data.
  • Yeah, this is data preprocessing. I believe you follow me well.
    Now, we learn about package dplyr, which is most skillful way to preprocess data in R.
    As you can see here, dplyr supports seven major functions. You will actually use the above functions through the back slides.
  • Yesterday, all computer was installed dplyr.
    So, you don’t have to install it. Just load package ‘dplyr’

    And you want some guide about dplyr, you can use help function.
    Give it a try!
  • Let me introduce the imdb dataset that we will use to apply dplyr function. imdb dataset mainly has information related to movies, TV and so on.
    It's a very large dataset, so we're going to do pre-processing with some scaled-down imdb datasets.
    You can load a dataset through read.csv function and its url link. Now get the data on your computer.
    If there's a problem, raise your hand and get some helps.
  • I think most of you have done it, and now let's look at the structure of the dataset.
    You can use str function to look at the structure of the data. Through str function, you can see the number of observations in the data, total number of variables, names of the variables, type of data in the variables, and some observations in each variable.
    Our data has 15190 observations and 44 variables.
  • Now let’s start lab example.
    We will use dplyr function and apply it imdb dataset.
    Familiarize yourself with the function and enter the code on it.
    After the lab, you will know what data preprocessing is.
  • First, you can use filter function.
    Through filter function, you can extract specific rows that fit the conditions.

    So, first preprocessing is to find recent, high rating and many rating counts movies.
    I’ll give you time to write this code, so now, just follow me.
  • Second function is mutate.
    You can make new column through mutate function.
    With this function, you can make a new column that show whether a movie contains both an action and an adventure genre.
    I named the column ActionAdven.
    If ActionAdven column is 1, it contains both genres.
    And ActionAdven column value 0 means it doesn’t contain one or more genre.
  • Third function is select.
    You can extract specific columns that fit the conditions with select function.
    So, we make a dataframe that only has wordsInTitle, imdbRating, ratingCount, year, ActionAdven columns.
    Write above these three codes on your r and Check the result is same like pictures.
  • So, I think most of you have written the select function code.
    Let’s move on the next.
    next function is arrange. With arrange function, we can sort data from small to large value based on specific column.
    And also, you can sort data from large to small value with desc function.
    Now, we would like to sort data by the year.
  • Finally, with most powerful function, you can write code with all functions at once.
    This function name is chain function, and with this function, you can link all the codes you used before.
    Please write down the code below and check the results. If it was written correctly, the results of the imdb_chain and imdb7 would be the same.
  • This is all the code of the lab example we just did.
  • Now let's try to solve the problem.
    There are two problems.
    Read it and fill in the blanks.
    After you solve it, try matching the answer to the next one.
  • I think most of them are done, so I'll finish this selection with this. You learned what data preprocessing is and why we have to do. And we looked at the use of dplyr,
    most powerful data processing package in R.

    I hope what you've learned today will be helpful to you when you're doing an analysis through big data. I'll finish with this. Thank you.

×