If you've had the fortune of implementing neural networks in pylearn2, you've probably had to wrangle with the source code a few times. I began today trying to understand how 'Dataset' class in pylearn2, which stores both the features and the targets, ever communicates which things to use as features and which to use as targets. Trying to answer this took me on a quick tour of all the major aspects of the code base, so I'm documenting what I learned before I forget it. As a preface, you should understand the basics of
data_specs (tldr: data_specs is a tuple of spaces and sources; spaces are objects that describe the structure of the numpy arrays and theano batches, while sources are string labels). Also, while wandering through this code base, it generally helps to remember the distinction between regular python functions and symbolic theano expressions; for instance, 'expr' in the Cost objects looks like you can call it as a regular function, but it's actually supposed to be used to compile a theano expression. Same goes for 'fprop' in the MLP object.
1) A Dataset, when queried, should be able to specify a tuple of data
spaces (which are objects that define the structure of the data; at runtime the data is numpy arrays, and during compilation the data is a symbolic theano batch) and data
sources (which are string labels); these are bundled together in a tuple called the data
specs.
2) A model, when queried, should also be able to provide a set of input
spaces (again, an object defining the structure of the data) and input
sources (which are strings). If supervised, it also provides a target space and source.
3)
The input space of the model does NOT have to be the same as the space provided by a Dataset, but the string specified in 'sources' should be. *This* is what is used to match up the sources specified by the dataset with the sources required by the model. The matching up occurs in FiniteDatasetIterator; more on this later. Note that if you use a DenseDesignMatrix, your data sources will be 'features' and 'targets', which lines up with what is specified by default for the MLP model.
4) Come train time (I think I saw this in the train.py script), you can either provide a training algorithm or use the training function that comes with a model (the latter should only be done for models that require a very specialised training algorithm; I think I saw a model.train function somewhere). In the MLP case, you're probably using a training algorithm like SGD.
5) This training algorithm requires you to specify a Cost object (which takes an instance of your model), or will use the model's default Cost object if none is provided (there's a get_default_cost method or something like it associated with the model, which in the case of a MLP seems to return an instance of 'Default' from costs/mlp/__init__.py, which in turn uses the theano expression model.cost_from_X).
6) Now look back at model.cost_from_X which is used by Default in costs/mlp/__init__.py to compile the symbolic cost expression. The signature of models.cost_from_X expects to be provided a variable 'data'. Looking down further in the function, we see 'data' is actually a tuple of X and Y; the X is passed into self.fprop, and the output of the fprop is passed, along with Y, to self.cost (which in the case of the MLP calls the cost function associated with the final layer; for a sigmoidal output, this cost is the mean KL divergence).
7) When the Cost object is asked for its data specs, it likely provides them by inheriting from DefaultDataSpecsMixin. Check out what DefaultDataSpecsMixin does! If self.supervised is true, it returns, for the space: CompositeSpace([model.get_input_space(), model.get_target_space()]). And for the source: (model.get_input_source(), model.get_target_source()). This lines up with the X,Y tuple that the variable 'data' was supposed to consist of in model.cost_from_X.
8) Going back to your SGD training algorithm. It will ask its Cost function for what data_specs it needs. The Cost object will return the tuple from DefaultDataSpecsMixin, which as mentioned above is a tuple of the (input_space, target_space) specified by the model. Come iteration time, these data_specs are passed to model.iterator, which in the case of the DenseDesignMatrix is returning an instance of FiniteDatasetIterator from utils/iteration.py.
9) Finally, we close the loop (no pun intended). FiniteDatasetIterator has access both to the data_specs of the Dataset object and the data_specs that were demanded by the Cost object. To line the two up, it literally finds the indices of the string matches between the sources specified by the Dataset object and the Sources specified by the cost object (no kidding: "idx = dataset_source.index(so)", where so is the source from the cost object). If you don't provide FiniteDatasetIterator with a "convert" array (which specifies a series of functions to convert between the format of the numpy arrays in Dataset and the numpy arrays needed by Cost), the FiniteDatasetIterator will just rely on dspace.np_format_as(batch, sp), where dspace is a Space from the Dataset object and sp is a Space from the Cost object.
10) I still need to hammer out exactly how the nesting/flattening of the data source tuples is taken care of (this is important because a CompositeSpace can be built of other CompositeSpace objects), but it looks like that is handled in DataSpecsMapping from utils/data_specs.py. Anyway, this should at least be useful in remembering how the different parts of pylearn2 link up together.