Sunday, January 25, 2015

Things you should be wary of: switching to on-disk strategies so you can fit multiple jobs in memory...

I recently made it possible to use pytables with my pylearn2 code, and I was excited because, in addition to allowing me to work with larger datasets, I thought it would also allow me to kick off a large number of jobs with medium-sized datasets in parallel. So, I had two such jobs running on the server, and then, unrelatedly, I tried doing some line-by-line processing of a large file...and noticed that it was taking FOREVER, and that the bottleneck seemed to be I/O. To my chagrin, I'd forgotten that when you switch to I/O heavy jobs, parallel processing becomes limited by the number of read-write heads you have...and according to the internets, most hard disks can only have one active head at a time. What's worse, if you're parallelising, you're requiring that head to bounce around between very different regions of the disk, incurring significant overhead. It makes me cringe just thinking about it. [stupid stupid stupid]

No comments:

Post a Comment