2. About me Relevant work Tasks Computer security research Credit risk modeling Pricing strategy Direct marketing Places American Express Johnson and Johnson DoubleClick VeriSign LinkedIn (now)
4. Today’s talk What I wrote If you need to store a big lookup table, consider implementing the table using an environment. Environment objects are implemented using hash tables. Vectors and lists are not. This means that looking up a value with n element in a list can take O(n) time. Looking up the value in an environment object takes O(1) time on average
5. Today’s talk What I read after the book was printed Re: [R] beginner Q: hashtable or dictionary? From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk> Date: Mon 30 Jan 2006 - 18:37:00 ESTOn Sun, 29 Jan 2006, hadleywickham wrote:>> use a 'list': > > Is a list O(1) for setting and getting?Can you elaborate? R is a vector language, and normally you create a list in one pass, and you can retrieve multiple elements at once.Retrieving elements by name from a long vector (including a list) is very fast, as an internal hash table is used.Does the following item from ONEWS answer your question? Indexing a vector by a character vector was slow if both the vector and index were long (say 10,000). Now hashing is used and the time should be linear in the longer of the lengths (but more memory is used). Indexing by number is O(1) except where replacement causes the list vector to be copied. There is always the option to use match() to convert to numeric indexing. -- Brian D. Ripley, Professor of Applied Statistics, University of Oxford Retrieving elements by name from a long vector (including a list) is very fast, as an internal hash table is used. Professor Brian D. Ripley
6. Today’s talk A short introduction to objects in R Looking up values in R How lookup tables are implemented in R Measuring lookup speed Optimizing lookup speed
7. Objects in R Everything in R is an object. Here are some examples of objects. Numeric Vector: > onehalf <- 1/2 > class(onehalf) [1] "numeric”
8. Objects in R Integer Vector: > four <- as.integer(4) > four [1] 4 > class(four) [1] "integer”
9. Objects in R Character vector: > zero <- "zero" > class(zero) [1] "character”
10. Objects in R Logical vector: > this.is.interesting <- FALSE > class(this.is.interesting) [1] "logical"
11. Objects in R Vectors can have multiple elements > one.to.five <- 1:5 > class(one.to.five) [1] "integer" > six.to.ten <- c(6, 7, 8, 9, 10) > class(six.to.ten) [1] "numeric"
12. Objects in R Lists contain heterogeneous collections of objects > stuff <- list(3.14, "hat", FALSE) > class(stuff) [1] "list"
13. Objects in R Functions are also objects in R: > f <- function(x, y) {+ x + y+ }> ffunction(x, y) { x + y}> class(f)[1] "function"
14. Objects in R Environments map names to objects. They are used within R itself to map variable names to objects. You can access these environment objects, or create your own. > one <- 1 > two <- 2 > three <- 3 > objects() [1] "one" "three" "two" > e <- .GlobalEnv > class(e) [1] "environment" > objects(e) [1] "e" "one" "three" "two"
15. Lookups You can look up an item in a vector, list, or array within R Let’s define a vector:> a <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)> a[1] 1 2 3 4 5 6 7 8 9 10 You can refer to elements by index:> a[3][1] 3
16. Lookups It's also possible to name elements in a vector, then refer to them by name: > b <- c(Joe=1, Bob=2, Jim=3)> b["Bob"]Bob This can be very convenient: you can use every vector in R as a table. You can access the name vector through the names function: > names(b)[1] "Joe" "Bob" "Jim"
17. Lookups Named vectors in R are implemented using two different arrays: a.20 names(a.20)
18. Lookups The name lookup algorithm works roughly like this: function(vector, name) { for (i in 1:length(vector)) { if (names(vector)[i] == name) return vector[i] } return NA
26. Lookups In vectors, Looking up a value by index takes a constant amount of time. Looking up a value by name (potentially) requires looking at every name in the names array. (This means that lookup times scale linearly with the number of items in the table.)
27. Lookups Environments store (and fetch) data using a different structure. They use hash tables. Hash tables rely on a hash function to map labels to indices.
28. Lookups Simple hash table implementation Example: store 15 ¾ for “Joe” Calculate h(“Joe”) Store 15 ¾ in thetable in slot h(“Joe”) h(“Joe”) = 4
29. Lookups If you carefully choose the size of the hash table and the hash function, you can store and lookup values in constant time (on average) in hash tables.
30. Measuring Lookup Speed In theory, looking up values in environments should be faster than looking up values in vectors. In practice, how much difference does this make? Let’s measure how much time it takes to look up values in vectors and environments, using different lookup methods
31. Measuring Lookup Speed Let's build a large, labeled vector for testing: labeled.array<- function(n) {a <- 1:nfrom <- “1234567890"to <- "ABCDEFGHIJ"for (i in 1:n) {names(a)[i] <- chartr(from, to, i) }a } Here's an example of the output of this function:> a.20 <- labeled.array(20)> a.20A B C D E F G H I AJ AA AB AC AD AE AF AG AH AI BJ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
32. Measuring Lookup Speed Let's also create environment objects for testing:labeled.environment <- function(n) {e <- new.env(hash=TRUE, size=n) from <- "1234567890” to <- "ABCDEFGHIJ” for (i in 1:n) {assign(x=chartr(from, to, i), value=i, envir=e) }e} Here’s an example of the output of this function: > e.20 <- labeled.environment(20) > e.20 <environment: 0x143756c>
33. Measuring Lookup Speed You can fetch values from an environment object with the get function > get("A",envir=e.20)[1] 1> get("BA",envir=e.20)[1] 20 You can also fetch values from an environment with the double bracket operator > e.20[["A"]][1] 1> e.20[["BA"]][1] 20
34. Measuring Lookup Speed Creating examples for testingarrays <- list()for (i in 10:15) {arrays[[as.character(2 ** i)]] <-labeled.array(2 ** i)}environments <- list()for (i in 10:15) {environments[[as.character(2 ** i)]] <-labeled.environment(2 ** i)}
35. Measuring Lookup Speed Using the test function:test_expressions("first element, by index:",function(d,l,r) {s<- 0 for (v in 1:r) {s<- s + d[1] }},arrays, 1024) Output:first element, by index:1024 2048 4096 8192 16384 327680.010 0.003 0.004 0.003 0.005 0.004
40. Optimizing Lookup Speed How to write efficient code: Write code for clarity, not speed Check to see if the code is fast enough. If it is fast enough, stop. Test your code to find where time is being spent Fix the parts of your code that are taking enough time. Go to step 2
41. Optimizing Lookup Speed How do you make lookups fast? Lookups by position are fastest If you have to lookup up single values by name, write your code with double-brackets Double-bracket lookups are a little faster than single bracket lookups If you discover that your code is too slow, you can easily change from vectors to environments
42. Optimizing Lookup Speed What if Your code is too slow You need to look up values by name It would be hard to change your code to use double-bracket notation Define a bracket operator for environments!
43. Optimizing Lookup Speed Remember that everything in R is a function, even lookup operators. Example code: > b <- c(Joe=1, Bob=2, Jim=3)> b["Bob"]Bob 2
44. Optimizing Lookup Speed Translation of the example code: > b["Bob"] Bob 2 > as.list(quote(b["Bob"])) [[1]] `[` [[2]] b [[3]] [1] "Bob"
46. Optimizing Lookup Speed Here is the code for our new subset function`[` <- function(x, i, j, ..., drop=TRUE) { if (class(x) == "environment”) {get(x=i, envir=x) } else { .Primitive("[")(x, i, j, ..., drop=TRUE) } }
47. Optimizing Lookup Speed Assignments through bracket notation are a little funny. For example, R evaluates x[3:5] <- 13:15 as if this code had been executed: `*tmp*` <- xx <- "[<-"(`*tmp*`, 3:5, value=13:15)rm(`*tmp*`)
48. Optimizing Lookup Speed Here is the code for our new subset assignment function`[<-` <- function(x, i, j, ..., value) { if (class(x) == "environment”) {assign(x=i, value=value, envir=x) # the assign statement returns value, # but we want to return the environment:x } else { .Primitive("[<-")(x, i, j, ..., value) } }
49. How to reach me twitter: @jadlerhttp://www.linkedin.com/in/josephadlerbaseballhacks@gmail.com
51. A function to test the performance of a lookup function on an object:test_expressions <-function(description, fun, data, reps) {cat(paste(description,"")) results <- vector() for (n in names(data)) {results[[n]] <- system.time(fun(data[[n]], as.integer(n), reps) )[["user.self"]] }print(results) }
52. To figure out the full argument list for the bracket operator, use the getGeneric function: > getGeneric("[") standardGeneric for "[" defined from package "base" function (x, i, j, ..., drop = TRUE) standardGeneric("[", .Primitive("[")) <environment: 0x11a6828> Methods may be defined for arguments: x, i, j, drop Use showMethods("[") for currently available ones.
53. In general, you should set new methods with the setMethod function. Example: setClass("myenv", representation(e="environment"))setMethod("[",signature(x="myenv", i="character”, j="missing"),function(x,i,j,...,drop=TRUE) { get(x=i,envir=x@e) })Unfortunately, R doesn’t let you redefine these operators for environments, so we have to do something trickier.
Notas del editor
I have about fifteen years of experience in data mining and data analysis. I’ve worked in a variety of industries: financial services, pharmaceuticals, internet companies.
And I’ve written a couple books on data analysis. Today’s talk isn’t about a subject in either book, but it is inspired by a passagein the second book.
Before I start today’s talk, I want to explain to you why I’m talking about this topic.In my book, one of my chapters is devoted to performance tips. One of my performance tips was about how to quickly look up a value in a table of values.
Then, I was reading through some old comments on R mailing lists and ran into this message.How many people in the room own a copy of this book? <Pick up MASS book> (For those who don’t, how many have used the MASS library?)So, the guy who wrote this email is the guy who wrote this book (and the MASS package)This made me feel really nervous that I had written something incorrect, so I decided to take a closer look at how tables are implemented in R.Today, I’m going to tell you about how lookups in R work, how I tested their performance, and how you can use this information to help you write faster R code.
Today, I’m going to tell you the story of how I tested the performance of different lookup methods in RI’m going to give a short introduction to different types of objects in R,Then explain to you how I tested performance(testing performance used some interesting features in R)Next, I will tell you about the results And if you’re all still awake, I will tell you how to optimize your program’s performance
Everything in R is an object. We will start by looking at a few simple data types in R.The data type that you will probably encounter most frequently in R is the numeric vector.Numeric vectors represent numeric values.The class function tells you the class of an object; the class tells R what methods (or functions) can be applied to an object
Here is another example of a data type in R: integers.Notice that I use the function as.integer to explicitly request an integerIf you were to just type 4, R would return a numeric value
Here is another important example of an objectCharacter vectors represent text valuesIn many other languages, these are called strings
Another example data type is the logical vectorAll of the example so far have been vectors with one elementBut of course, vectors can have multiple elements. Let’s look at a couple examples
The colon operator is used to define a sequence of values. It always returns integers. (A trick to return a single integer is to just have a range from one value to itself.)The combine function (“c”) is used to combine a set of values together into a vector.
If you need to represent a heterogenous collection of objects, you can use a list.A very common type of list is a data frame. Data frames are like database tables (or tables in Excel); they contain multiple columns representing different variables in a data set.
Everything in R is an objectEven functions
Let’s move on to another important type of object.If you work with R, you have probably used vectors and lists. You have also used environment objects, but you may not have realized itAt any time in R, there are a set of objects that you can access. You may have given these objects names. R represents these relationships as environments.In the example session that I show here, I created three objects, named “one”, “two” and “three”R stored information mapping these names to these values in an environment called the global environmentI assigned the symbol “e” to point to the global environment (environments are just objects, like everything else in R)Then I showed the class of “e”I also used the objects function to show the objects defined in this environment. Notice that the objects include one, two, three, and e.
Now, let’s talk about how you look up a value in an object in R.To do this, we’ll define a simple example vector. Here, I defined a vector named “a” with ten valuesYou can use the bracket operator to refer to a specific location. In this example, I looked up the third item in a, which was the value 3.
(next page shows algorithm)(then walks through example)
As an example, we will show how R looks up the value with the label “F” in the array “a.20”To do this, R iterates through each value in the names array to find the index of the correct value. Then R returns the correct value. <next slide>
R looks up the first item in the names array, which does not match.
Then, R looks up the second item and checks if it matches.
R continues to iterate through the names array until it find the match.
Ah, found the matching value. The index for the match is 5
Here is a simple example of how hash table workI’m leaving out some important details here.- Most importantly, I don’t explain what to do when two labels hash to the same value (this is called a hash collision).- Nor do I talk about how you choose the hash table size, or the hash function.- A full discussion of hash functions is beyond the scope of this talk. (It’s beyond the scope of most algorithms classes!)
Notice that R doesn’t print out environment objects in a friendly way.
For testing, I generated a set of different arrays and environments with between 1024 and 32768 elementsI generated one object for each power of two<go to next page>
To test the lookup speed, I wrote a function called “test expressions” that would Print a message Time how long it took to apply a function to a set of different sized data objects many times You can specify the message, the function, the set of data objects, and the number of repetitions (for each objectNotice that this function takes another function as an argument!In the example here, I show how I tested the performance of looking up the first value in each object by index. (I calculated a sum rather than just returning values.)
Here are the results from my tests.How many people think that I should use a chart to present this data?As a show of hands, how many people in this room have read Tufte’s books?How many people raised you hand for both?Seriously, I don’t think that this is enough data to bother plotting. It’s hard to read on the screen (because the type is small), but the trends are so clear that you can see them by just looking at numbers.Let me show you some interesting trends.
First, let’s look at the array lookups by name. Notice that these values increase linearly with the number of elements in the array
Now, let’s focus on the results for the biggest arrays<change to next slide>
There are two key takeaways.First, looking up a single value in an array (by index), or an environment (by symbol) is very fast, regardless of table size.Next, notice that lookups by name are much, much slower in arrays. The only exception is looking up the first value in an array by double bracket. Double bracket notation is a little faster.So, what does this mean? <turn to next page>
You could always use environment objects instead of vectors to store tables of values.But I think that will lead you to write more code.You should use whatever method is simplest and easiest to implement your program. When you know that it runs correctly, then you can optimize it.Here is the process that I use to write efficient code.
By the way, even R language expressions are objects in R. That’s how I can show how R parses this expression here.