Friday, January 30, 2015

Stratified sampling code + av_scripts repo

Here's an issue I've encountered before, so I think it's worth documenting: when splitting into training/testing/validation sets, if your data is heterogenous it's important to keep the proportions of the various classes roughly equal or you can get frustrating results. A random split often doesn't achieve this.

I've written code to do this; it's in my av_scripts repo and it's called splitIntoTrainingTestAndValidationSets.py. You specify the columns representing the classes and the proportions of the train/test/validation files; it will produce three files in the same directory as your input file with the splits.

Setup instructions
These instructions are also present here but I am reproducing them for convenience:
$ git clone https://github.com/kundajelab/av_scripts.git
$ export UTIL_SCRIPTS_DIR=/path/to/av_scripts
$ export PATH=$PATH:$UTIL_SCRIPTS_DIR/exec
(see the linked page for more details)

Example
splitIntoTrainingTestAndValidationSets.py --inputFile allLabels_200K.tsv --categoryColumns 1 2 --splitProportions 0.85 0.075 0.075

produces split_train_allLabels_200K.tsv, split_valid_allLabels_200K.tsv and split_test_allLabels_200K.tsv, with the proportions of the values from columns 1 and 2 are roughly even in those files.

If you don't have a validation split, you can do this:
splitIntoTrainingTestAndValidationSets.py --inputFile allLabels_200K.tsv --categoryColumns 1 2 --splitProportions 0.85 0.15 --splitNames train test

To get split_train_allLabels_200K.tsv and split_test_allLabels_200K.tsv. You can also view help with splitIntoTrainingAndValidationSets.py --help

Differences compared to the scikit-learn implementation
Scikitlearn can only do the split by the explicit class variable; sometimes, the class is not what you're predicting, so it isn't part of either your training data or testing label, but it still affects performance. To take my example, my prediction task was enhancer/not-enhancer, but I still wanted an even representation of cell-types in my positive set. Another issue with the scikit learn implementation is that it requires reading the data into memory. If your data doesn't fit into memory, my code will do the split by reading chunks of size --batchSize and doing the split batch by batch. One perk of this is that if your data is sorted by some score, by setting --batchSize small enough you can get a roughly even representation of scores.



No comments:

Post a Comment