Itinerant Bioinformaticist: Things you should be wary of: switching to on-disk strategies so you can fit multiple jobs in memory...

Sunday, January 25, 2015

Things you should be wary of: switching to on-disk strategies so you can fit multiple jobs in memory...

I recently made it possible to use pytables with my pylearn2 code, and I was excited because, in addition to allowing me to work with larger datasets, I thought it would also allow me to kick off a large number of jobs with medium-sized datasets in parallel. So, I had two such jobs running on the server, and then, unrelatedly, I tried doing some line-by-line processing of a large file...and noticed that it was taking FOREVER, and that the bottleneck seemed to be I/O. To my chagrin, I'd forgotten that when you switch to I/O heavy jobs, parallel processing becomes limited by the number of read-write heads you have...and according to the internets, most hard disks can only have one active head at a time. What's worse, if you're parallelising, you're requiring that head to bounce around between very different regions of the disk, incurring significant overhead. It makes me cringe just thinking about it. [stupid stupid stupid]

No comments:

Post a Comment

We exist. We google. We bang heads.

I started this blog (originally called 'undergraduatebioinformaticist', hence the 'we exist' in the tagline above) because in my years of UROP-ing at a bioinformatics facility at MIT, I realized just how hard it is to glean free knowledge about this rather caffeinated field. I put hard effort into deciphering what I know from online forums and patchy documentation. I'm jolly well going to share it.

Post-MIT, I had a year-long stint at a tech company in Silicon Valley, and now I'm pursuing a PhD at Stanford.