Sunday, October 28, 2012

Dislike for data frames

Apparently there's something special about October. Every year, around the same time, I seem to have an unfortunate encounter with R and afterwards seek to spread my hard-earned wisdom.

See, reading most tutorials you'd never really know why you'd ever want to use a matrix instead of a data frame, apart from the fact that some functions only take matrices as input. I'm going to tell you a very important reason:

ALL ACCESSES IN DATA FRAMES ARE O(n).

Don't know what that means? Here's what it means: supposing you have 100 rows in a data frame, and you wish to access the last row. No big deal, R finds the row and gives it to you in the blink of an eye.

Now imagine your data frame has 1000 rows.

The same operation is going to take you about 10 times longer.*

It doesn't even matter if you have the rows named. Data frames don't work as hash tables. R is just going to plod along through every single row till it finds a row with your desired name.

$#@%@$@$!!!

Matrices, on the other hand, perform accesses in the proper, well-mannered O(1) time, meaning the time for an access doesn't scale with the size of the matrix. If your R code is taking forever to run, chances are you have a big, obnoxious, dataframe access sitting around somewhere.

You're welcome.

*To be fair, all of this is based on how much slower my code became. Maybe it's not quite O(n), but it's certainly not O(1), and I'm not going to waste time rigorously analysing the asymptotic behaviour of code I am now very upset with