Hey everyone. I've done a fair bit of work during past year using R. Recently, I wrote a small script that analyzes some data from a poker league which I think demonstrates a few intersting language features in R which some of you might find interesting. I used the term metaprogramming in the subject line somewhat loosely and probably have used pseudo-metaprogramming because the code that I've written doesn't actually create new code. Rather, I've used a combination of the builtin language functions and runtime environment to save me from having to write code that I would have to write under most languages (eg Java). Also, I won't be demonstrating any of the builtin statistical features like linear models.
Before we get to the R features, I should start by telling you about how I scraped the player data and demonstrate an interesting data structure in R called a data frame. The player data was scraped from the league website using a small Java program which then writes the data in R. The following is an example of the Java output.
Code:
Travis_R<-data.frame(date=c("05/12/12"), time=c("3:30"), location=c("CD's Cattle Co."), place=c(9), players=c(12), points=c(10))
Eric_M<-data.frame(date=c("04/07/12"), time=c("1:00"), location=c("CD's Cattle Co."), place=c(13), players=c(13), points=c(6))
To save space, I've selected two players that only played one session. As you can probably make out, the <- operator assigns a data.frame to a players name (eg Travis_r). As with most statistical analysis, columns indicate the factors or dimensions of the data. Rows represent the instances, in this case, information related to how a player performed on a given date, time, location, etc. In total, fifity-five players played during the previous quarter with each represented by a data frame with their name. The data from the previous years contains has over two hundred players.
Here is what the data frame looks like in R.
Code:
> Travis_R
date time location place players points
1 05/12/12 3:30 CD's Cattle Co. 9 12 10
> Andrew_H
date time location place players points
1 05/12/12 3:30 CD's Cattle Co. 12 12 7
2 05/12/12 1:00 CD's Cattle Co. 10 11 9
3 05/05/12 3:30 CD's Cattle Co. 1 13 100
4 05/05/12 1:00 CD's Cattle Co. 1 11 100
5 04/28/12 1:00 CD's Cattle Co. 14 22 5
6 04/21/12 3:30 CD's Cattle Co. 8 12 15
7 04/21/12 1:00 CD's Cattle Co. 8 11 15
8 04/14/12 3:30 CD's Cattle Co. 1 10 100
9 04/14/12 1:00 CD's Cattle Co. 4 12 50
10 04/07/12 3:30 CD's Cattle Co. 4 9 50
11 04/07/12 1:00 CD's Cattle Co. 8 13 15
People with database experience, will probably notice a fair bit of similarity between a data frame and database table. Also, the data frame supports features that are similar to SQL queries. I'll show a couple of brief examples, if anyone is interested in discussing this aspect in further detail, we can create another thread.
Code:
> Andrew_H[1:5,] #RETURNS THE FIRST FIVE ROWS AND ALL COLS
date time location place players points
1 05/12/12 3:30 CD's Cattle Co. 12 12 7
2 05/12/12 1:00 CD's Cattle Co. 10 11 9
3 05/05/12 3:30 CD's Cattle Co. 1 13 100
4 05/05/12 1:00 CD's Cattle Co. 1 11 100
5 04/28/12 1:00 CD's Cattle Co. 14 22 5
> Andrew_H[1:5,1:3] #RETURNS THE FIRST FIVE ROWS, BUT ONLY THE FIRST THREE COLS
date time location
1 05/12/12 3:30 CD's Cattle Co.
2 05/12/12 1:00 CD's Cattle Co.
3 05/05/12 3:30 CD's Cattle Co.
4 05/05/12 1:00 CD's Cattle Co.
5 04/28/12 1:00 CD's Cattle Co.
> Andrew_H$place #ANOTHER WAY FOR SELECTING A COL
[1] 12 10 1 1 14 8 8 1 4 4 8
> mean(Andrew_H$place) #MOST OF THE MATH FUNCTIONS OPERATE ON VECTORS OF DATA
[1] 6.454545
> mean(Andrew_H[,4:5]) #THE PLACE AND PLAYERS COLS, ALL ROWS.
place players
6.454545 12.363636
Notice that most of the builtin R functions are able to work on vectors or matrices of data (similar to Matlab and other math software). This is a great aid in doing mathematical analysis. In many cases, the output is surprisingly complete saving you the trouble of having to write "reporting" code.
Getting to the pseudo-metaprogramming (finally), as I mentioned previously, fifty-five players were involved during this quarter. I really don't want to write some silly code like players<-list(player1, player2, player3, ..., player55). Further, all of us have probably dealt with or can imagine data sets that are much larger (eg publically traded stocks, DNA records for individuals, star databases, etc). Obviously, this type of code can be very tedious. Thankfully, I can use some of the builtin R functions to eliminate most of this work.
Code:
source("Poker League Analysis/5th_street_quarterly_data.R")
players<-ls()
The first line in the code above, "sources" the data file, which brings the fifty-five player data frames into the runtime environment. The ls() function in the next line, "lists" the variables currently available.
Code:
[1] "Allyn_B" "Amy_H" "Andrew_H" "Ashley_D" "Barbara_B" "Brian_W"
[7] "Canteen_R" "Chris_D" "Chris_F" "Daniel_M" "Dennis_K" "Dianna_R"
[13] "Dick_L" "Drew_T" "Elmer_L" "Eric_M" "Erwin_B" "Garrett_F"
...
You'll probably notice that the result from ls() are strings, not the actual data frames. We'll handle this issue in a moment. You might also be thinking to yourself, what if other variables have already been defined? Well, I left off one line of code above [ie rm(list=ls(all=TRUE))], which removes all the existing variables. Now that I have a vector of players (ie 1dim array), how do I use this in my code? Let's take a simple example, I want to know how many players (in total, not unique players) have played at CDs versus Plaza Bowl.
Code:
CDs = 0
PB = 0
for (i in 1:length(players)) {
player<-get(players[i])
CDs = CDs + length(which(player$location == "CD's Cattle Co."))
PB = PB + length(which(player$location == "Plaza Bowl"))
}
CDs
PB
when runs produces...
> CDs = 0
> PB = 0
> for (i in 1:length(players)) {
+ player<-get(players[i])
+ CDs = CDs + length(which(player$location == "CD's Cattle Co."))
+ PB = PB + length(which(player$location == "Plaza Bowl"))
+ }
> CDs
[1] 146
> PB
[1] 298
As you can see, the get() function is used to perform reflection on the variable names in the players vector. Next I update the variables CDs and PB to count the number of times people have played at each location. A pretty nifty feature that helps eliminate some coding tedium. Has anyone else here used R?