Itinerant Bioinformaticist: January 2015

Friday, January 30, 2015

Stratified sampling code + av_scripts repo

Here's an issue I've encountered before, so I think it's worth documenting: when splitting into training/testing/validation sets, if your data is heterogenous it's important to keep the proportions of the various classes roughly equal or you can get frustrating results. A random split often doesn't achieve this.

I've written code to do this; it's in my av_scripts repo and it's called splitIntoTrainingTestAndValidationSets.py. You specify the columns representing the classes and the proportions of the train/test/validation files; it will produce three files in the same directory as your input file with the splits.

Setup instructions
These instructions are also present here but I am reproducing them for convenience:
$ git clone https://github.com/kundajelab/av_scripts.git
$ export UTIL_SCRIPTS_DIR=/path/to/av_scripts
$ export PATH=$PATH:$UTIL_SCRIPTS_DIR/exec
(see the linked page for more details)

Example
splitIntoTrainingTestAndValidationSets.py --inputFile allLabels_200K.tsv --categoryColumns 1 2 --splitProportions 0.85 0.075 0.075

produces split_train_allLabels_200K.tsv, split_valid_allLabels_200K.tsv and split_test_allLabels_200K.tsv, with the proportions of the values from columns 1 and 2 are roughly even in those files.

If you don't have a validation split, you can do this:
splitIntoTrainingTestAndValidationSets.py --inputFile allLabels_200K.tsv --categoryColumns 1 2 --splitProportions 0.85 0.15 --splitNames train test

To get split_train_allLabels_200K.tsv and split_test_allLabels_200K.tsv. You can also view help with splitIntoTrainingAndValidationSets.py --help

Differences compared to the scikit-learn implementation
Scikitlearn can only do the split by the explicit class variable; sometimes, the class is not what you're predicting, so it isn't part of either your training data or testing label, but it still affects performance. To take my example, my prediction task was enhancer/not-enhancer, but I still wanted an even representation of cell-types in my positive set. Another issue with the scikit learn implementation is that it requires reading the data into memory. If your data doesn't fit into memory, my code will do the split by reading chunks of size --batchSize and doing the split batch by batch. One perk of this is that if your data is sorted by some score, by setting --batchSize small enough you can get a roughly even representation of scores.

Thursday, January 29, 2015

Inline for loops in bash...

...will make you happy. Here's a code snippet:

$ for i in a b c; do echo "file"$i".txt"; done

filea.txt

fileb.txt

filec.txt

Monday, January 26, 2015

Remember to sort when using bedtools groupby!

Learned this the hard way. At the very end of the bedtools groupby help message, in the notes section, it says: "The input file/stream should be sorted/grouped by the -grp. columns". AUGH. Note that this is absolutely NOT a constraint posed by SQL's group by, even if their bold claim of 'Akin to the SQL "group by" command' may lead you believe otherwise. Talk about false advertising!

Sunday, January 25, 2015

Excel is surprisingly powerful

First off, if you didn't know about Excel formulas and column filters, learn about them NOW. I'll update this incrementally as I use an excel feature; I don't want to spend to much time writing this post.

Formulas
Exhibit A: =IF(AND((F2 > 0.09 + (0.68/1.3)*D2), (F2 > 0.025 + 0.8*D2)),1,0)

(TIL about a little summary bar at the bottom of your excel window which tells you the sum. If you set it up with an if statement like the one above, which outputs 1 if true and 0 if false, you can use this real-time-updating info bar to rapidly figure out how many of your selected rows satisfy a condition. Isn't that nifty?)

String concat
It can be done with an &; useful if, for instance, you need to turn "chr", "start", "end" into "chr:start-end" - but if you're doing to much of this you should probably learn inline perl. But this is handy if you already have excel open for some reason.

VLookup
I've usually avoided having to use this but it can be used to do the equivalent of a join - though the far more intuitive way would be to sort the two tables by a common key. And if you're spending too much time on this you should, again, probably do it programmatically. It's just useful to remember.

Conditional formatting
This is really what I wanted to share on this post today. Don't waste time dumping your stuff into a separate heatmap application when a simple conditional formatting will suffice:

(sorry R/G colorbind folks...)

ooooo.

Things you should be wary of: switching to on-disk strategies so you can fit multiple jobs in memory...

I recently made it possible to use pytables with my pylearn2 code, and I was excited because, in addition to allowing me to work with larger datasets, I thought it would also allow me to kick off a large number of jobs with medium-sized datasets in parallel. So, I had two such jobs running on the server, and then, unrelatedly, I tried doing some line-by-line processing of a large file...and noticed that it was taking FOREVER, and that the bottleneck seemed to be I/O. To my chagrin, I'd forgotten that when you switch to I/O heavy jobs, parallel processing becomes limited by the number of read-write heads you have...and according to the internets, most hard disks can only have one active head at a time. What's worse, if you're parallelising, you're requiring that head to bounce around between very different regions of the disk, incurring significant overhead. It makes me cringe just thinking about it. [stupid stupid stupid]

Friday, January 23, 2015

How the different parts of pylearn2 link together

If you've had the fortune of implementing neural networks in pylearn2, you've probably had to wrangle with the source code a few times. I began today trying to understand how 'Dataset' class in pylearn2, which stores both the features and the targets, ever communicates which things to use as features and which to use as targets. Trying to answer this took me on a quick tour of all the major aspects of the code base, so I'm documenting what I learned before I forget it. As a preface, you should understand the basics of data_specs (tldr: data_specs is a tuple of spaces and sources; spaces are objects that describe the structure of the numpy arrays and theano batches, while sources are string labels). Also, while wandering through this code base, it generally helps to remember the distinction between regular python functions and symbolic theano expressions; for instance, 'expr' in the Cost objects looks like you can call it as a regular function, but it's actually supposed to be used to compile a theano expression. Same goes for 'fprop' in the MLP object.

1) A Dataset, when queried, should be able to specify a tuple of data spaces (which are objects that define the structure of the data; at runtime the data is numpy arrays, and during compilation the data is a symbolic theano batch) and data sources (which are string labels); these are bundled together in a tuple called the data specs.

2) A model, when queried, should also be able to provide a set of input spaces (again, an object defining the structure of the data) and input sources (which are strings). If supervised, it also provides a target space and source.

3) The input space of the model does NOT have to be the same as the space provided by a Dataset, but the string specified in 'sources' should be. *This* is what is used to match up the sources specified by the dataset with the sources required by the model. The matching up occurs in FiniteDatasetIterator; more on this later. Note that if you use a DenseDesignMatrix, your data sources will be 'features' and 'targets', which lines up with what is specified by default for the MLP model.

4) Come train time (I think I saw this in the train.py script), you can either provide a training algorithm or use the training function that comes with a model (the latter should only be done for models that require a very specialised training algorithm; I think I saw a model.train function somewhere). In the MLP case, you're probably using a training algorithm like SGD.

5) This training algorithm requires you to specify a Cost object (which takes an instance of your model), or will use the model's default Cost object if none is provided (there's a get_default_cost method or something like it associated with the model, which in the case of a MLP seems to return an instance of 'Default' from costs/mlp/__init__.py, which in turn uses the theano expression model.cost_from_X).

6) Now look back at model.cost_from_X which is used by Default in costs/mlp/__init__.py to compile the symbolic cost expression. The signature of models.cost_from_X expects to be provided a variable 'data'. Looking down further in the function, we see 'data' is actually a tuple of X and Y; the X is passed into self.fprop, and the output of the fprop is passed, along with Y, to self.cost (which in the case of the MLP calls the cost function associated with the final layer; for a sigmoidal output, this cost is the mean KL divergence).

7) When the Cost object is asked for its data specs, it likely provides them by inheriting from DefaultDataSpecsMixin. Check out what DefaultDataSpecsMixin does! If self.supervised is true, it returns, for the space: CompositeSpace([model.get_input_space(), model.get_target_space()]). And for the source: (model.get_input_source(), model.get_target_source()). This lines up with the X,Y tuple that the variable 'data' was supposed to consist of in model.cost_from_X.

8) Going back to your SGD training algorithm. It will ask its Cost function for what data_specs it needs. The Cost object will return the tuple from DefaultDataSpecsMixin, which as mentioned above is a tuple of the (input_space, target_space) specified by the model. Come iteration time, these data_specs are passed to model.iterator, which in the case of the DenseDesignMatrix is returning an instance of FiniteDatasetIterator from utils/iteration.py.

9) Finally, we close the loop (no pun intended). FiniteDatasetIterator has access both to the data_specs of the Dataset object and the data_specs that were demanded by the Cost object. To line the two up, it literally finds the indices of the string matches between the sources specified by the Dataset object and the Sources specified by the cost object (no kidding: "idx = dataset_source.index(so)", where so is the source from the cost object). If you don't provide FiniteDatasetIterator with a "convert" array (which specifies a series of functions to convert between the format of the numpy arrays in Dataset and the numpy arrays needed by Cost), the FiniteDatasetIterator will just rely on dspace.np_format_as(batch, sp), where dspace is a Space from the Dataset object and sp is a Space from the Cost object.

10) I still need to hammer out exactly how the nesting/flattening of the data source tuples is taken care of (this is important because a CompositeSpace can be built of other CompositeSpace objects), but it looks like that is handled in DataSpecsMapping from utils/data_specs.py. Anyway, this should at least be useful in remembering how the different parts of pylearn2 link up together.

Monday, January 19, 2015

Upgrade to bleeding edge version solved issue with Theano and Anaconda

I recently switched to using the python that comes with Anaconda (sidebar: is an Anaconda a kind of python? I would google it except, for the first time, I have the problem where googling search terms gets me the technical usage instead of the layperson one). Theano started hiccupping quite badly, both on my local Mac and on the linux server (I got different errors depending on the machine). In both cases, switching to the bleeding-edge version of theano (http://deeplearning.net/software/theano/install.html#bleeding-edge-install-instructions) made the problem go away. Googling the error on linux didn't turn up any solutions, so I am copying it here so the search engine fairy will index it:

'Compilation failed (return status=1): /usr/bin/ld: cannot find -lf77blas. collect2: ld returned 1 exit status. '