Quantcast

Maximum PC

It is currently Wed Jul 30, 2014 11:01 am

All times are UTC - 8 hours




Post new topic Reply to topic  [ 3 posts ] 
Author Message
 Post subject: Metaprogramming fun in R
PostPosted: Wed May 23, 2012 8:15 pm 
Bitchin' Fast 3D Z8000*
Bitchin' Fast 3D Z8000*
User avatar

Joined: Tue Jun 29, 2004 11:32 pm
Posts: 2555
Location: Somewhere between compilation and linking
Hey everyone. I've done a fair bit of work during past year using R. Recently, I wrote a small script that analyzes some data from a poker league which I think demonstrates a few intersting language features in R which some of you might find interesting. I used the term metaprogramming in the subject line somewhat loosely and probably have used pseudo-metaprogramming because the code that I've written doesn't actually create new code. Rather, I've used a combination of the builtin language functions and runtime environment to save me from having to write code that I would have to write under most languages (eg Java). Also, I won't be demonstrating any of the builtin statistical features like linear models.

Before we get to the R features, I should start by telling you about how I scraped the player data and demonstrate an interesting data structure in R called a data frame. The player data was scraped from the league website using a small Java program which then writes the data in R. The following is an example of the Java output.
Code:
Travis_R<-data.frame(date=c("05/12/12"), time=c("3:30"), location=c("CD's Cattle Co."), place=c(9), players=c(12), points=c(10))
Eric_M<-data.frame(date=c("04/07/12"), time=c("1:00"), location=c("CD's Cattle Co."), place=c(13), players=c(13), points=c(6))

To save space, I've selected two players that only played one session. As you can probably make out, the <- operator assigns a data.frame to a players name (eg Travis_r). As with most statistical analysis, columns indicate the factors or dimensions of the data. Rows represent the instances, in this case, information related to how a player performed on a given date, time, location, etc. In total, fifity-five players played during the previous quarter with each represented by a data frame with their name. The data from the previous years contains has over two hundred players.

Here is what the data frame looks like in R.
Code:
> Travis_R
      date time        location place players points
1 05/12/12 3:30 CD's Cattle Co.     9      12     10
> Andrew_H
       date time        location place players points
1  05/12/12 3:30 CD's Cattle Co.    12      12      7
2  05/12/12 1:00 CD's Cattle Co.    10      11      9
3  05/05/12 3:30 CD's Cattle Co.     1      13    100
4  05/05/12 1:00 CD's Cattle Co.     1      11    100
5  04/28/12 1:00 CD's Cattle Co.    14      22      5
6  04/21/12 3:30 CD's Cattle Co.     8      12     15
7  04/21/12 1:00 CD's Cattle Co.     8      11     15
8  04/14/12 3:30 CD's Cattle Co.     1      10    100
9  04/14/12 1:00 CD's Cattle Co.     4      12     50
10 04/07/12 3:30 CD's Cattle Co.     4       9     50
11 04/07/12 1:00 CD's Cattle Co.     8      13     15

People with database experience, will probably notice a fair bit of similarity between a data frame and database table. Also, the data frame supports features that are similar to SQL queries. I'll show a couple of brief examples, if anyone is interested in discussing this aspect in further detail, we can create another thread.
Code:
> Andrew_H[1:5,]       #RETURNS THE FIRST FIVE ROWS AND ALL COLS
      date time        location place players points
1 05/12/12 3:30 CD's Cattle Co.    12      12      7
2 05/12/12 1:00 CD's Cattle Co.    10      11      9
3 05/05/12 3:30 CD's Cattle Co.     1      13    100
4 05/05/12 1:00 CD's Cattle Co.     1      11    100
5 04/28/12 1:00 CD's Cattle Co.    14      22      5
> Andrew_H[1:5,1:3]           #RETURNS THE FIRST FIVE ROWS, BUT ONLY THE FIRST THREE COLS
      date time        location
1 05/12/12 3:30 CD's Cattle Co.
2 05/12/12 1:00 CD's Cattle Co.
3 05/05/12 3:30 CD's Cattle Co.
4 05/05/12 1:00 CD's Cattle Co.
5 04/28/12 1:00 CD's Cattle Co.
> Andrew_H$place                  #ANOTHER WAY FOR SELECTING A COL
[1] 12 10  1  1 14  8  8  1  4  4  8
> mean(Andrew_H$place)       #MOST OF THE MATH FUNCTIONS OPERATE ON VECTORS OF DATA
[1] 6.454545
> mean(Andrew_H[,4:5])        #THE PLACE AND PLAYERS COLS, ALL ROWS.
    place   players
6.454545 12.363636

Notice that most of the builtin R functions are able to work on vectors or matrices of data (similar to Matlab and other math software). This is a great aid in doing mathematical analysis. In many cases, the output is surprisingly complete saving you the trouble of having to write "reporting" code.

Getting to the pseudo-metaprogramming (finally), as I mentioned previously, fifty-five players were involved during this quarter. I really don't want to write some silly code like players<-list(player1, player2, player3, ..., player55). Further, all of us have probably dealt with or can imagine data sets that are much larger (eg publically traded stocks, DNA records for individuals, star databases, etc). Obviously, this type of code can be very tedious. Thankfully, I can use some of the builtin R functions to eliminate most of this work.
Code:
source("Poker League Analysis/5th_street_quarterly_data.R")
players<-ls()

The first line in the code above, "sources" the data file, which brings the fifty-five player data frames into the runtime environment. The ls() function in the next line, "lists" the variables currently available.
Code:
[1] "Allyn_B"    "Amy_H"      "Andrew_H"   "Ashley_D"   "Barbara_B"  "Brian_W"   
[7] "Canteen_R"  "Chris_D"    "Chris_F"    "Daniel_M"   "Dennis_K"   "Dianna_R" 
[13] "Dick_L"     "Drew_T"     "Elmer_L"    "Eric_M"     "Erwin_B"    "Garrett_F"
...

You'll probably notice that the result from ls() are strings, not the actual data frames. We'll handle this issue in a moment. You might also be thinking to yourself, what if other variables have already been defined? Well, I left off one line of code above [ie rm(list=ls(all=TRUE))], which removes all the existing variables. Now that I have a vector of players (ie 1dim array), how do I use this in my code? Let's take a simple example, I want to know how many players (in total, not unique players) have played at CDs versus Plaza Bowl.

Code:
CDs = 0
PB = 0
for (i in 1:length(players)) {
  player<-get(players[i])
  CDs = CDs + length(which(player$location == "CD's Cattle Co."))
  PB = PB + length(which(player$location == "Plaza Bowl"))
}
CDs
PB

when runs produces...
> CDs = 0
> PB = 0
> for (i in 1:length(players)) {
+   player<-get(players[i])
+   CDs = CDs + length(which(player$location == "CD's Cattle Co."))
+   PB = PB + length(which(player$location == "Plaza Bowl"))
+ }
> CDs
[1] 146
> PB
[1] 298

As you can see, the get() function is used to perform reflection on the variable names in the players vector. Next I update the variables CDs and PB to count the number of times people have played at each location. A pretty nifty feature that helps eliminate some coding tedium. Has anyone else here used R?


Top
  Profile  
 
 Post subject: Re: Metaprogramming fun in R
PostPosted: Wed May 23, 2012 10:28 pm 
Team Member Top 50
Team Member Top 50

Joined: Sat Jun 25, 2005 11:04 am
Posts: 1026
I've used R before. It's great for statistics. I never knew about the get function, it looks really cool.

An alternative way to do this would be to create a list of the player data frames. This way you'd maintain all the functionality but wouldn't mess the environment up (remove declared variables).

For example:
Code:
> v = list(player1=data.frame(date=c("05/12/12"), time=c("3:30")), player2=data.frame(date=c("06/12/12"), time=c("4:40")))
> v
$player1
      date time
1 05/12/12 3:30

$player2
      date time
1 06/12/12 4:40

// get the names of each player
> attributes(v)$names
[1] "player1" "player2"

// get the data frame for each player
> v[[1]]
      date time
1 05/12/12 3:30
>


Top
  Profile  
 
 Post subject: Re: Metaprogramming fun in R
PostPosted: Sat May 26, 2012 1:13 am 
Bitchin' Fast 3D Z8000*
Bitchin' Fast 3D Z8000*
User avatar

Joined: Tue Jun 29, 2004 11:32 pm
Posts: 2555
Location: Somewhere between compilation and linking
mag wrote:
I've used R before. It's great for statistics. I never knew about the get function, it looks really cool.

Metaprogramming / Reflection can come in really handy. At Boeing, I once replaced a 1400 line if statement with 6 or 7 lines of code using reflection. Somewhat ironically, I also wrote a data structure in Java that was basically data frame. I really wish that I would have known about R at the time!

Did you use R at school or at a job? What type of statistical work were you doing?

mag wrote:
An alternative way to do this would be to create a list of the player data frames. This way you'd maintain all the functionality but wouldn't mess the environment up (remove declared variables).

When people are working in an interactive environment like R or Matlab, they don't mix up their environments. You always start clean. For example, I would never start an instance of R for analyzing a poker data set then switch to another data set like the stock market.

For a small data set, probably less than 500 records, I wouldn't use a list because having to type something like you suggested or...
Code:
players[which(attributes(players)$names == "player1")]

... kind of defeats the point of working in an environment like R. I want to be able to work the data frames directly (especially if I'm analyzing the data with non-programmers who are going to be WTF are you doing!).

On those occasions when I'm working with a large data set, I like to break the data up into multiple lists based on some type of categorization then use attach() and detach() to bring the data frames in and out of scope.
Code:
> players<-list(player1=data.frame(a,b,c), player2=data.frame(a,b,c))
> attach(players)
> player1
  a  b   c
1 1 10 100
2 2 11 101
3 3 12 102
> player2
  a  b   c
1 1 10 100
2 2 11 101
3 3 12 102
> detach(players)
> player1
Error: object 'player1' not found
>


Top
  Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 3 posts ] 

All times are UTC - 8 hours


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group