Itinerant Bioinformaticist: 2015

Sunday, December 6, 2015

fatal error: 'cblas.h' file not found

I once again found myself installing caffe on my Mac, and ran into that error. The solution was to install openblas via "brew install openblas", and then edit the relevant lines in Makefile.config to look like this:

BLAS := open
BLAS_INCLUDE := /usr/local/Cellar/openblas/0.2.14_1/include
BLAS_LIB := /usr/local/Cellar/openblas/0.2.14_1/lib

Also, I was getting some weird permission errors about inability to do the linking (when using brew install). I tried "sudo brew link openblas" and then got a confusing error about sudo. The solution was to execute the following lines:

sudo chown -R `whoami`:admin /usr/local/bin

sudo chown -R `whoami`:admin /usr/local/share

After that I was able to brew link as needed.

Contents of my Makefile.config, largely for my own reference (I think the only other things I modified were the python paths for anaconda: http://pastebin.com/RscLA8j9)

Thursday, December 3, 2015

How to parse a caffe deploy.prototxt file or solver.prototxt file using python.

1) Install google's protobuf by downloading the source from https://developers.google.com/protocol-buffers/docs/downloads (I had protobuf-2.6.1.tar.gz), untarring (tar -zxvf [filename]), and following the instructions in the README

2) Get your caffe .proto file. It lives in [caffe folder]/src/caffe/proto/caffe.proto

3) Compile the caffe.proto file with protoc --python_out=. caffe.proto; it will produce caffe_pb2.py in the same directory.

4.1) For parsing a deploy.prototxt file, use:

import caffe_pb2 #you created this module with the protoc command

from google.protobuf.text_format import Merge

net = caffe_pb2.NetParameter()

Merge((open("deploy.prototxt",'r').read()), net)

4.2) For parsing a solver.prototxt file, use:

import caffe_pb2 #you created this module with the protoc command

from google.protobuf.text_format import Merge

solver = caffe_pb2.SolverParameter()

Merge((open("solver.prototxt",'r').read()), solver)

Remember, you can inspect the attributes of the object using dir(net) or dir(solver)

Tuesday, August 18, 2015

Yaml is really slow in vim

Apparently this is a known issue.

Here I'm documenting some specifics of the fix that weren't clear from the stackoverflow answer. Not totally sure if I did the "right" thing but it works.

1) Download https://raw.githubusercontent.com/stephpy/vim-yaml/master/after/syntax/yaml.vim to ~/.vim
2) Put "au BufNewFile,BufRead *.yaml,*.yml so ~/.vim/yaml.vim" at the top of ~/.vimrc

You should be good to go.

Recursive find and replace

This was way too hard to figure out. I gave up on using sed because I was getting "invalid command code ." and "illegal byte sequence" due to the -i option being interpreted differently on osx, and ended up corrupting my git repository..

find . -type f -print0 | xargs -0 perl -i -pe '$_ =~ s/progressUpdates/progressUpdate/g'

Tuesday, July 14, 2015

Numpy str on integers extremely slow?

I found a weird performance difference today: create a numpy array of integers, and then convert it to an array of strings in a list comprehension. It's weirdly slow. Cast the numpy integers to python int's before calling str(), and you get nearly a 10x speedup. Observe:

$ python -m timeit 'import numpy as np; newX = [x for x in np.arange(4096)]; [str(x) for x in newX]'

100 loops, best of 3: 10.5 msec per loop

$ python -m timeit 'import numpy as np; newX = [x for x in np.arange(4096)]; [str(int(x)) for x in newX]'

1000 loops, best of 3: 1.23 msec per loop

Monday, July 13, 2015

Line by line python performance profiling (in particular, about where that @profile decorator comes from)

Here's the thing I didn't understand about rkern's line-by-line performance profiler: the kernprof executable contains information about the @profile decorator - no import is necessary. When you run your script without kernprof, you will get an error. But if you run kernprof -l [original python call], it works. Happy profiling!

Tuesday, May 26, 2015

Caffe incompatible with Boost 1.58.0

Boost 1.58.0 was released on April 17th, and it breaks compilation with Caffe on OSX (when the build reaches one of the "NVCC" steps, you get a string of errors starting with complaining about the absence of a semicolon, that look like this: /usr/local/include/boost/smart_ptr/detail/sp_counted_base_clang.hpp(27): error: expected a ";"

Googling landed me on this page which suggested an incompatibility with boost 1.58.0 (https://groups.google.com/forum/#!topic/caffe-users/fY2r6bO3_0w). However, the steps suggested in the solution (as of May 26th; it seems to have been updated since) didn't work for me. So this is what I did, and it's kindof sketchy but oh well...

[July 16th update: Ian Blenke put together a boost-1.57 brew and mentioned how to use it in the comments; it looks a lot easier and less sketchy to install, but if that does not work for you, read on...]

1) cd $(brew --prefix) (for me this cd's to /usr/local/).
2) cd Library/Formula
3) Make a backup of the files boost.rb and boost-python.rb
4) Replace the boost.rb file with the contents of https://raw.githubusercontent.com/Homebrew/homebrew/6fd6a9b6b2f56139a44dd689d30b7168ac13effb/Library/Formula/boost.rb
5) Replace the boost-python.rb file with the contents of https://raw.githubusercontent.com/Homebrew/homebrew/3141234b3473717e87f3958d4916fe0ada0baba9/Library/Formula/boost-python.rb

6) brew uninstall boost (this should uninstall the 1.58.0 version)
7) brew install boost (this should install 1.57.0 since you've replaced the rb files)

(Edit: You *may* have to do "brew link boost" after this - can't recall exactly. Also, Parang Saraf pointed out a "no rule to make target" error in the comments, which they solved with "make clean")

Also, I missed this guide the first time I went through the page, so I'm linking to it in case you did too: http://caffe.berkeleyvision.org/install_osx.html

Tuesday, April 7, 2015

Adventures installing shogun and shogun's python modular interface

First of all, whatever instructions they have on their main website seem to be out of date, because there is no "configure", so ./configure does not do anything.

Second, I think there used to be a file called "INSTALL", but there isn't anymore. And the README is utterly unhelpful. This is all I could find, and following it seemed to work. That github repo also seems to be the most up to date in terms of documentation.

If ccmake is installed, use it - it will give you insight into all the instructions. If not, well, you need to "cmake .. -DCmdLineStatic=ON -DPythonModular=ON" to get both the static command line interfaces and the python modular interface (and I even tossed in a -DPythonStatic=ON although I'm not sure it did anything...I was driving blind due to no ccmake, and for whatever reason none of this easily accessible online!). Also, my colleague pointed me to this quickstart, also on the GitHub.

I had to install swig for the python modular interface. IMPORTANT: with swig 3.0.2, the python modular interface gave me a segfault; but downgrading to 2.0.12 worked. I've done it in the past without root access, but that time I had ccmake and could easily tell shogun where to find swig - this time, I didn't have ccmake, so I couldn't figure out what to specify (seriously, this is awful documentation even by open source standards!). If you can't get root access, playing around with CMakeCache.txt in the build directory might be a start (note that the build directory has one CMakeChache.txt and the main directory also has one; you should modify the one in the build directory).

Finally, shogun installed completely error-free, but "import shogun" was not working. It was because shogun had installed to /usr/local/lib/python2.7/site-packages/, which wasn't even in my actual python's path, let alone anaconda's. To get it to work with anaconda, I copied the folder shogun and the files modshogup.py and _modshogun.so to /path/to/anaconda/lib/python2.7/site-packages.

To diagnose the issue with your particular installation of python, start the python shell, do "import sys", and then do "print sys.path" - if the place that shogun decided to put the python libraries (displayed on the screen when you do "make install", and also present in CMakeCache.txt in the build folder) is not in that list, well, that's the problem. By hook or by crook, get it on that list. One thing you can do is go back and recompile shogun, but with PYTHON_PACKAGES_PATH set to the right value - as I don't have ccmake, the only way I can think to set this is to open up CMakeCache.txt in your build folder and modify it there. I also hear something funky with the "PYTHONPATH" environment variable is supposed to work but for whatever reason, it never does for me. As a last resort, you can do "sys.path.insert(0, /path/to/shogun's/python/packages/folder/)"

I was then confronted with a whiny "ImportError: libshogun.so.17: cannot open shared object file: No such file or directory". I looked at the messages when shogun was installed and confirmed that libshogun.so.17 did in fact live in /usr/local/lib. Turns out I had to do "sudo ldconfig" to get that to behave. NO ONE MENTIONED THAT ANYWHERE.

Monday, April 6, 2015

Error: 'SecTrustEvaluateAsync' undeclared

I got this error when trying to compile cmake on my mac. It apparently concerns curl in some way. It was fixed by commenting out some of the lines form 2003 to 2048 in Utilities/cmcurl/lib/vtls/curl_darwinssl.c - but I'm not sure whether to take the first or the second block (for now I'm taking the first block of the if statement).

Adventures installing gkmersvm

gkmersvm is an svm kernel implementation from the Beer lab: http://www.beerlab.org/gkmsvm/

I ran into a couple of issues when trying to compile it, and I thought I'd describe them here.

When I first typed in make, I got: fatal error: 'tr1/unordered_map' file not found

This was fixed by upgrading my gcc using the instructions on: http://stackoverflow.com/questions/837992/update-gcc-on-osx

Then make resulted in: "'getopt' was not declared in this scope"

This was fixed by adding "#include <getopt.h>" to src/mainSVMclassify.cpp:

When I did this on our linux server, I also got a bunch of errors about "intptr_t" has not been declared. This was fixed by adding #include <stdint.h> to the following files:

src/CountKLmersGeneral.cpp:19

src/MLEstimKLmers.cpp

src/LKTree.cpp

src/LTree.cpp

src/LKTree.h

After this, "make" produced the necessary binaries gkmsvm_classify, gkmsvm_kernel and gkmsvm_train. Curiously enough, there's no "make install", so I manually copied the binaries to /usr/local/bin.

Finally, when I tried to run the modshogun script, I got the error "Cannot create new instances of type 'Labels'". I think this is because I was using shogun 4.0.0, and the script was written for Shogun version 0.9.2 ~ 1.1.0. Changing every instance of "Labels" to "BinaryLabels" fixed the issue. Note, if you are getting a segfault with the modular interface, or are having trouble installing it, see my post about shogun.

Also...after that gcc upgrade, I ran into the following error with a line containing "dispatch_block_t", which once again requires modifying some source code: http://hamelot.co.uk/programming/osx-gcc-dispatch_block_t-has-not-been-declared-invalid-typedef/

Saturday, April 4, 2015

Copying files via the terminal gets around "read only" error from USB in Ubuntu

This was very weird. I was recently trying to get files off an old laptop that had suffered a fall. The internet connection was not working. Bluetooth was not working. And it looked like an old USB I found, which I erased and reformatted multiple times, wasn't working either. When I tried pasting files into the drive via the GUI, I got the following error (yes I know how to take screenshots, but remember that whether or not I could get files off this computer was an open question):

And yet, copying into /media/[drive name] via the terminal worked, although it was slow:

[shakes head]

Tuesday, March 10, 2015

Course waiver info for Stanford courses

The C.S. PhD program has a course waiver process, and given that so many of the students come from similar institutions, there should really be a pre-approved list of courses to transfer over for every area, like there is for "Software Systems". But no, I had to fill out the course waiver form in its entirety, and didn't want to risk annoying the approver by leaving fields blank because I assumed they were familiar with the course. But the least I can do is dump the info here for the two courses I did apply for a waiver for (CS 173 and CS 161, using 6.047 and 6.046 from MIT respectively)

General Info

Stanford course to be waived

: CS 173

Advisor name

: Gill Bejerano

Non-Stanford Course Info

University Name

: Massachusetts Institute of Technology

Instructor

: Manolis Kellis

Year

: 2012

Course website

: http://stellar.mit.edu/S/course/6/fa12/6.047/index.html

Link to course assignments

: http://stellar.mit.edu/S/course/6/fa12/6.047/materials.html

Description:

Covers the algorithmic and machine learning foundations of computational biology, combining theory with practice. Principles of algorithm design, influential problems and techniques, and analysis of large-scale biological datasets. Topics include (a) genomes: sequence analysis, gene finding, RNA folding, genome alignment and assembly, database search; (b) networks: gene expression analysis, regulatory motifs, biological network analysis; (c) evolution: comparative genomics, phylogenetics, genome duplication, genome rearrangements, evolutionary theory. These are coupled with fundamental algorithmic techniques including: dynamic programming, hashing, Gibbs sampling, expectation maximization, hidden Markov models, stochastic context-free grammars, graph clustering, dimensionality reduction, Bayesian networks.

Syllabus:

Dynamic Programming, Global and local alignment
Database search, Rapid string matching, BLAST, BLOSUM
Multiple, Progressive, Phylogenetic, Whole-genome alignment
Whole-Genome Comparative genomics: Evolutionary signatures, Duplication
Genome Assembly: Consensus-alignment-overlap, Graph-based assembly
Hidden Markov Models Part 1: Evaluation / Parsing, Viterbi/Forward algorithm
Hidden Markov Models Part 2: PosteriorDecoding / Learning Baum Welch
Structural RNAs: Fold prediction, genome-wide annotation
Transcript structure analysis, Differential Expression, Significance Testing
Small and Large regulatory RNAs: lincRNA, miRNA, piRNA
Expression analysis: Clustering and Classification, K-means, Naïve-Bayes
Clustering: affinity propagation; Classification: Random Forests
Epigenomics: ChIP-Seq, Burrows-Wheeler alignment, Chromatin States
Regulatory Motifs: Discovery, Representation, PBMs, Gibbs Sampling, EM
Expression deconvolution functional data of mixed samples
Network Inference: Physical and functional networks, information integration
The ENCODE Project: Systematic experimentation and integrative genomics
Dimentionality reduction, PCA, Self-Organizing Maps, SVMs
Disease association mapping, GWAS, organismal phenotypes
Quantitative trait mapping, eQTLs, molecular trait variation
Linkage Disequilibrium, Haplotype phasing, variant imputat
Molecular Evolution, Tree Building, Phylogenetic inference
Gene/species trees, reconciliation, recombination graphs
Mutation rate estimation, coalescent, models of evolution
Missing Heritability, Complex Traits, Interpret GWAS, Rank-based enrichment
Recent human evolution: Human history, human selection
Personal Genomics, Disease Epigenomics: Systems approaches to disease
Three-dimentional chromatin interactions: 3C, 5C, HiC, ChIA-Pet
Pharmacogenomics: Network-based systems biology of drug response
Synthetic Biology: Reading and writing genomes and cellular circuits

+ Final project

Textbook list:

No textbook; Kellis was formulating extensive class notes with the help of the students.

Stanford Course Info

Description:

Introduction to computational biology through an informatic exploration of the human genome. Topics include: genome sequencing; functional landscape of the human genome (genes, gene regulation, repeats, RNA genes, epigenetics); genome evolution (comparative genomics, ultraconservation, co-option). Additional topics may include population genetics, personalized genomics, and ancient DNA. Course includes primers on molecular biology, the UCSC Genome Browser, and text processing languages. Guest lectures on current genomic research topics.

Syllabus:

Introductory biology
Protein Coding Genes
UCSC Genome Browser Tutorial
Introduction to Text Processing Tutorial
Non-protein Coding Genes
Transcriptional Activation I
Transcriptional Regulation II
Transcriptional Regulation III
Genome Evolution I: Repeats
Genome Evolution II
Chains & Nets, Conservation & Function
Sequencing, Human Variation, and Disease
Personal Genomics, GSEA/GREAT
Transcription factor binding sites - Functions and Complexes
Population Genetics & Evo-Devo
Ancestral genome-phenotype mapping

Textbook list:

No specific textbook

General Info

Stanford course to be waived

: CS 161

Advisor name

: Serge Plotkin

Non-Stanford Course Info

University Name

: Massachusetts Institute of Technology

Instructor

: Bruce Tidor, Dana Moshkovitz

Year

: 2012

Course website

: http://stellar.mit.edu/S/course/6/sp12/6.046/index.html

Link to course assignments

: http://stellar.mit.edu/S/course/6/sp12/6.046/materials.html (assignments not public)

Description:

Techniques for the design and analysis of efficient algorithms, emphasizing methods useful in practice. Topics include sorting; search trees, heaps, and hashing; divide-and-conquer; dynamic programming; greedy algorithms; amortized analysis; graph algorithms; and shortest paths. Advanced topics may include network flow; computational geometry; number-theoretic algorithms; polynomial and matrix calculations; caching; and parallel computing.

(This is the second algorithms class in the standard sequence)

Syllabus:

Median Finding
Scheduling
Minimum Spanning Trees
Fast Fourier Transform
All-Pairs Shortest Paths
Randomized algorithms, high probability bounds
Hashing
Amortized Analysis
Competitive Analysis
Network Flow
van Emde Boas Data Structure
Disjoint Sets Data Structures
P vs. NP
Approximation Algorithms
Compression
Sub-linear time algorithms
Clustering
Derandomization
Computational Geometry

Textbook list:

Introduction to Algorithms (CLRS)

Stanford Course Info

Description:

Worst and average case analysis. Recurrences and asymptotics. Efficient algorithms for sorting, searching, and selection. Data structures: binary search trees, heaps, hash tables. Algorithm design techniques: divide-and-conquer, dynamic programming, greedy algorithms, amortized analysis, randomization. Algorithms for fundamental graph problems: minimum-cost spanning tree, connected components, topological sort, and shortest paths.

Syllabus:

Algorithmic complexity and analysis (4 lectures)
Randomization, divide and conquer (2 lectures)
Heaps and counting sort (1 lecture)
Hashing (2 lectures)
Tree and graph definitions and properties (1 lecture)
Binary Search Trees (1 lecture)
Greedy Algorithms (2 lectures)
Dynamic programming (3 lectures)
Graph algorithms (4 lectures)

Textbook list:

CLRS

Wednesday, February 25, 2015

Execute a command for every line in a file in bash

Another very handy bash snippet:

cat nodesToConsider.txt | while read in; do cp ../'layer0_inputGuess_n'$in'.png' .; done

Saturday, February 21, 2015

Diagonal axis labels in R

I'm dumping a relevant code snippet here because I know I will want to come back to it again:
(found here)

> bp = barplot(data$informativeness[1:len], axes=FALSE, axisnames=FALSE)
> text(bp, par("usr")[3], labels = data$feature[1:len], srt = 45, adj = c(1.1,1.1), xpd = TRUE, cex=.9)
> axis(2)

Thursday, February 5, 2015

For those times you have to turn in source code on a problem set...

I just discovered http://www.planetb.ca/syntax-highlight-word and it makes things much nicer (it'll do syntax highlight in a way that can be copied into Word)

Friday, January 30, 2015

Stratified sampling code + av_scripts repo

Here's an issue I've encountered before, so I think it's worth documenting: when splitting into training/testing/validation sets, if your data is heterogenous it's important to keep the proportions of the various classes roughly equal or you can get frustrating results. A random split often doesn't achieve this.

I've written code to do this; it's in my av_scripts repo and it's called splitIntoTrainingTestAndValidationSets.py. You specify the columns representing the classes and the proportions of the train/test/validation files; it will produce three files in the same directory as your input file with the splits.

Setup instructions
These instructions are also present here but I am reproducing them for convenience:
$ git clone https://github.com/kundajelab/av_scripts.git
$ export UTIL_SCRIPTS_DIR=/path/to/av_scripts
$ export PATH=$PATH:$UTIL_SCRIPTS_DIR/exec
(see the linked page for more details)

Example
splitIntoTrainingTestAndValidationSets.py --inputFile allLabels_200K.tsv --categoryColumns 1 2 --splitProportions 0.85 0.075 0.075

produces split_train_allLabels_200K.tsv, split_valid_allLabels_200K.tsv and split_test_allLabels_200K.tsv, with the proportions of the values from columns 1 and 2 are roughly even in those files.

If you don't have a validation split, you can do this:
splitIntoTrainingTestAndValidationSets.py --inputFile allLabels_200K.tsv --categoryColumns 1 2 --splitProportions 0.85 0.15 --splitNames train test

To get split_train_allLabels_200K.tsv and split_test_allLabels_200K.tsv. You can also view help with splitIntoTrainingAndValidationSets.py --help

Differences compared to the scikit-learn implementation
Scikitlearn can only do the split by the explicit class variable; sometimes, the class is not what you're predicting, so it isn't part of either your training data or testing label, but it still affects performance. To take my example, my prediction task was enhancer/not-enhancer, but I still wanted an even representation of cell-types in my positive set. Another issue with the scikit learn implementation is that it requires reading the data into memory. If your data doesn't fit into memory, my code will do the split by reading chunks of size --batchSize and doing the split batch by batch. One perk of this is that if your data is sorted by some score, by setting --batchSize small enough you can get a roughly even representation of scores.

Thursday, January 29, 2015

Inline for loops in bash...

...will make you happy. Here's a code snippet:

$ for i in a b c; do echo "file"$i".txt"; done

filea.txt

fileb.txt

filec.txt

Monday, January 26, 2015

Remember to sort when using bedtools groupby!

Learned this the hard way. At the very end of the bedtools groupby help message, in the notes section, it says: "The input file/stream should be sorted/grouped by the -grp. columns". AUGH. Note that this is absolutely NOT a constraint posed by SQL's group by, even if their bold claim of 'Akin to the SQL "group by" command' may lead you believe otherwise. Talk about false advertising!

Sunday, January 25, 2015

Excel is surprisingly powerful

First off, if you didn't know about Excel formulas and column filters, learn about them NOW. I'll update this incrementally as I use an excel feature; I don't want to spend to much time writing this post.

Formulas
Exhibit A: =IF(AND((F2 > 0.09 + (0.68/1.3)*D2), (F2 > 0.025 + 0.8*D2)),1,0)

(TIL about a little summary bar at the bottom of your excel window which tells you the sum. If you set it up with an if statement like the one above, which outputs 1 if true and 0 if false, you can use this real-time-updating info bar to rapidly figure out how many of your selected rows satisfy a condition. Isn't that nifty?)

String concat
It can be done with an &; useful if, for instance, you need to turn "chr", "start", "end" into "chr:start-end" - but if you're doing to much of this you should probably learn inline perl. But this is handy if you already have excel open for some reason.

VLookup
I've usually avoided having to use this but it can be used to do the equivalent of a join - though the far more intuitive way would be to sort the two tables by a common key. And if you're spending too much time on this you should, again, probably do it programmatically. It's just useful to remember.

Conditional formatting
This is really what I wanted to share on this post today. Don't waste time dumping your stuff into a separate heatmap application when a simple conditional formatting will suffice:

(sorry R/G colorbind folks...)

ooooo.

Things you should be wary of: switching to on-disk strategies so you can fit multiple jobs in memory...

I recently made it possible to use pytables with my pylearn2 code, and I was excited because, in addition to allowing me to work with larger datasets, I thought it would also allow me to kick off a large number of jobs with medium-sized datasets in parallel. So, I had two such jobs running on the server, and then, unrelatedly, I tried doing some line-by-line processing of a large file...and noticed that it was taking FOREVER, and that the bottleneck seemed to be I/O. To my chagrin, I'd forgotten that when you switch to I/O heavy jobs, parallel processing becomes limited by the number of read-write heads you have...and according to the internets, most hard disks can only have one active head at a time. What's worse, if you're parallelising, you're requiring that head to bounce around between very different regions of the disk, incurring significant overhead. It makes me cringe just thinking about it. [stupid stupid stupid]

Friday, January 23, 2015

How the different parts of pylearn2 link together

If you've had the fortune of implementing neural networks in pylearn2, you've probably had to wrangle with the source code a few times. I began today trying to understand how 'Dataset' class in pylearn2, which stores both the features and the targets, ever communicates which things to use as features and which to use as targets. Trying to answer this took me on a quick tour of all the major aspects of the code base, so I'm documenting what I learned before I forget it. As a preface, you should understand the basics of data_specs (tldr: data_specs is a tuple of spaces and sources; spaces are objects that describe the structure of the numpy arrays and theano batches, while sources are string labels). Also, while wandering through this code base, it generally helps to remember the distinction between regular python functions and symbolic theano expressions; for instance, 'expr' in the Cost objects looks like you can call it as a regular function, but it's actually supposed to be used to compile a theano expression. Same goes for 'fprop' in the MLP object.

1) A Dataset, when queried, should be able to specify a tuple of data spaces (which are objects that define the structure of the data; at runtime the data is numpy arrays, and during compilation the data is a symbolic theano batch) and data sources (which are string labels); these are bundled together in a tuple called the data specs.

2) A model, when queried, should also be able to provide a set of input spaces (again, an object defining the structure of the data) and input sources (which are strings). If supervised, it also provides a target space and source.

3) The input space of the model does NOT have to be the same as the space provided by a Dataset, but the string specified in 'sources' should be. *This* is what is used to match up the sources specified by the dataset with the sources required by the model. The matching up occurs in FiniteDatasetIterator; more on this later. Note that if you use a DenseDesignMatrix, your data sources will be 'features' and 'targets', which lines up with what is specified by default for the MLP model.

4) Come train time (I think I saw this in the train.py script), you can either provide a training algorithm or use the training function that comes with a model (the latter should only be done for models that require a very specialised training algorithm; I think I saw a model.train function somewhere). In the MLP case, you're probably using a training algorithm like SGD.

5) This training algorithm requires you to specify a Cost object (which takes an instance of your model), or will use the model's default Cost object if none is provided (there's a get_default_cost method or something like it associated with the model, which in the case of a MLP seems to return an instance of 'Default' from costs/mlp/__init__.py, which in turn uses the theano expression model.cost_from_X).

6) Now look back at model.cost_from_X which is used by Default in costs/mlp/__init__.py to compile the symbolic cost expression. The signature of models.cost_from_X expects to be provided a variable 'data'. Looking down further in the function, we see 'data' is actually a tuple of X and Y; the X is passed into self.fprop, and the output of the fprop is passed, along with Y, to self.cost (which in the case of the MLP calls the cost function associated with the final layer; for a sigmoidal output, this cost is the mean KL divergence).

7) When the Cost object is asked for its data specs, it likely provides them by inheriting from DefaultDataSpecsMixin. Check out what DefaultDataSpecsMixin does! If self.supervised is true, it returns, for the space: CompositeSpace([model.get_input_space(), model.get_target_space()]). And for the source: (model.get_input_source(), model.get_target_source()). This lines up with the X,Y tuple that the variable 'data' was supposed to consist of in model.cost_from_X.

8) Going back to your SGD training algorithm. It will ask its Cost function for what data_specs it needs. The Cost object will return the tuple from DefaultDataSpecsMixin, which as mentioned above is a tuple of the (input_space, target_space) specified by the model. Come iteration time, these data_specs are passed to model.iterator, which in the case of the DenseDesignMatrix is returning an instance of FiniteDatasetIterator from utils/iteration.py.

9) Finally, we close the loop (no pun intended). FiniteDatasetIterator has access both to the data_specs of the Dataset object and the data_specs that were demanded by the Cost object. To line the two up, it literally finds the indices of the string matches between the sources specified by the Dataset object and the Sources specified by the cost object (no kidding: "idx = dataset_source.index(so)", where so is the source from the cost object). If you don't provide FiniteDatasetIterator with a "convert" array (which specifies a series of functions to convert between the format of the numpy arrays in Dataset and the numpy arrays needed by Cost), the FiniteDatasetIterator will just rely on dspace.np_format_as(batch, sp), where dspace is a Space from the Dataset object and sp is a Space from the Cost object.

10) I still need to hammer out exactly how the nesting/flattening of the data source tuples is taken care of (this is important because a CompositeSpace can be built of other CompositeSpace objects), but it looks like that is handled in DataSpecsMapping from utils/data_specs.py. Anyway, this should at least be useful in remembering how the different parts of pylearn2 link up together.

Monday, January 19, 2015

Upgrade to bleeding edge version solved issue with Theano and Anaconda

I recently switched to using the python that comes with Anaconda (sidebar: is an Anaconda a kind of python? I would google it except, for the first time, I have the problem where googling search terms gets me the technical usage instead of the layperson one). Theano started hiccupping quite badly, both on my local Mac and on the linux server (I got different errors depending on the machine). In both cases, switching to the bleeding-edge version of theano (http://deeplearning.net/software/theano/install.html#bleeding-edge-install-instructions) made the problem go away. Googling the error on linux didn't turn up any solutions, so I am copying it here so the search engine fairy will index it:

'Compilation failed (return status=1): /usr/bin/ld: cannot find -lf77blas. collect2: ld returned 1 exit status. '