Itinerant Bioinformaticist: tips

Showing posts with label tips. Show all posts

Tuesday, August 18, 2015

Yaml is really slow in vim

Apparently this is a known issue.

Here I'm documenting some specifics of the fix that weren't clear from the stackoverflow answer. Not totally sure if I did the "right" thing but it works.

1) Download https://raw.githubusercontent.com/stephpy/vim-yaml/master/after/syntax/yaml.vim to ~/.vim
2) Put "au BufNewFile,BufRead *.yaml,*.yml so ~/.vim/yaml.vim" at the top of ~/.vimrc

You should be good to go.

Recursive find and replace

This was way too hard to figure out. I gave up on using sed because I was getting "invalid command code ." and "illegal byte sequence" due to the -i option being interpreted differently on osx, and ended up corrupting my git repository..

find . -type f -print0 | xargs -0 perl -i -pe '$_ =~ s/progressUpdates/progressUpdate/g'

Monday, July 13, 2015

Line by line python performance profiling (in particular, about where that @profile decorator comes from)

Here's the thing I didn't understand about rkern's line-by-line performance profiler: the kernprof executable contains information about the @profile decorator - no import is necessary. When you run your script without kernprof, you will get an error. But if you run kernprof -l [original python call], it works. Happy profiling!

Tuesday, April 7, 2015

Adventures installing shogun and shogun's python modular interface

First of all, whatever instructions they have on their main website seem to be out of date, because there is no "configure", so ./configure does not do anything.

Second, I think there used to be a file called "INSTALL", but there isn't anymore. And the README is utterly unhelpful. This is all I could find, and following it seemed to work. That github repo also seems to be the most up to date in terms of documentation.

If ccmake is installed, use it - it will give you insight into all the instructions. If not, well, you need to "cmake .. -DCmdLineStatic=ON -DPythonModular=ON" to get both the static command line interfaces and the python modular interface (and I even tossed in a -DPythonStatic=ON although I'm not sure it did anything...I was driving blind due to no ccmake, and for whatever reason none of this easily accessible online!). Also, my colleague pointed me to this quickstart, also on the GitHub.

I had to install swig for the python modular interface. IMPORTANT: with swig 3.0.2, the python modular interface gave me a segfault; but downgrading to 2.0.12 worked. I've done it in the past without root access, but that time I had ccmake and could easily tell shogun where to find swig - this time, I didn't have ccmake, so I couldn't figure out what to specify (seriously, this is awful documentation even by open source standards!). If you can't get root access, playing around with CMakeCache.txt in the build directory might be a start (note that the build directory has one CMakeChache.txt and the main directory also has one; you should modify the one in the build directory).

Finally, shogun installed completely error-free, but "import shogun" was not working. It was because shogun had installed to /usr/local/lib/python2.7/site-packages/, which wasn't even in my actual python's path, let alone anaconda's. To get it to work with anaconda, I copied the folder shogun and the files modshogup.py and _modshogun.so to /path/to/anaconda/lib/python2.7/site-packages.

To diagnose the issue with your particular installation of python, start the python shell, do "import sys", and then do "print sys.path" - if the place that shogun decided to put the python libraries (displayed on the screen when you do "make install", and also present in CMakeCache.txt in the build folder) is not in that list, well, that's the problem. By hook or by crook, get it on that list. One thing you can do is go back and recompile shogun, but with PYTHON_PACKAGES_PATH set to the right value - as I don't have ccmake, the only way I can think to set this is to open up CMakeCache.txt in your build folder and modify it there. I also hear something funky with the "PYTHONPATH" environment variable is supposed to work but for whatever reason, it never does for me. As a last resort, you can do "sys.path.insert(0, /path/to/shogun's/python/packages/folder/)"

I was then confronted with a whiny "ImportError: libshogun.so.17: cannot open shared object file: No such file or directory". I looked at the messages when shogun was installed and confirmed that libshogun.so.17 did in fact live in /usr/local/lib. Turns out I had to do "sudo ldconfig" to get that to behave. NO ONE MENTIONED THAT ANYWHERE.

Monday, April 6, 2015

Adventures installing gkmersvm

gkmersvm is an svm kernel implementation from the Beer lab: http://www.beerlab.org/gkmsvm/

I ran into a couple of issues when trying to compile it, and I thought I'd describe them here.

When I first typed in make, I got: fatal error: 'tr1/unordered_map' file not found

This was fixed by upgrading my gcc using the instructions on: http://stackoverflow.com/questions/837992/update-gcc-on-osx

Then make resulted in: "'getopt' was not declared in this scope"

This was fixed by adding "#include <getopt.h>" to src/mainSVMclassify.cpp:

When I did this on our linux server, I also got a bunch of errors about "intptr_t" has not been declared. This was fixed by adding #include <stdint.h> to the following files:

src/CountKLmersGeneral.cpp:19

src/MLEstimKLmers.cpp

src/LKTree.cpp

src/LTree.cpp

src/LKTree.h

After this, "make" produced the necessary binaries gkmsvm_classify, gkmsvm_kernel and gkmsvm_train. Curiously enough, there's no "make install", so I manually copied the binaries to /usr/local/bin.

Finally, when I tried to run the modshogun script, I got the error "Cannot create new instances of type 'Labels'". I think this is because I was using shogun 4.0.0, and the script was written for Shogun version 0.9.2 ~ 1.1.0. Changing every instance of "Labels" to "BinaryLabels" fixed the issue. Note, if you are getting a segfault with the modular interface, or are having trouble installing it, see my post about shogun.

Also...after that gcc upgrade, I ran into the following error with a line containing "dispatch_block_t", which once again requires modifying some source code: http://hamelot.co.uk/programming/osx-gcc-dispatch_block_t-has-not-been-declared-invalid-typedef/

Saturday, April 4, 2015

Copying files via the terminal gets around "read only" error from USB in Ubuntu

This was very weird. I was recently trying to get files off an old laptop that had suffered a fall. The internet connection was not working. Bluetooth was not working. And it looked like an old USB I found, which I erased and reformatted multiple times, wasn't working either. When I tried pasting files into the drive via the GUI, I got the following error (yes I know how to take screenshots, but remember that whether or not I could get files off this computer was an open question):

And yet, copying into /media/[drive name] via the terminal worked, although it was slow:

[shakes head]

Wednesday, February 25, 2015

Execute a command for every line in a file in bash

Another very handy bash snippet:

cat nodesToConsider.txt | while read in; do cp ../'layer0_inputGuess_n'$in'.png' .; done

Saturday, February 21, 2015

Diagonal axis labels in R

I'm dumping a relevant code snippet here because I know I will want to come back to it again:
(found here)

> bp = barplot(data$informativeness[1:len], axes=FALSE, axisnames=FALSE)
> text(bp, par("usr")[3], labels = data$feature[1:len], srt = 45, adj = c(1.1,1.1), xpd = TRUE, cex=.9)
> axis(2)

Thursday, February 5, 2015

For those times you have to turn in source code on a problem set...

I just discovered http://www.planetb.ca/syntax-highlight-word and it makes things much nicer (it'll do syntax highlight in a way that can be copied into Word)

Thursday, January 29, 2015

Inline for loops in bash...

...will make you happy. Here's a code snippet:

$ for i in a b c; do echo "file"$i".txt"; done

filea.txt

fileb.txt

filec.txt

Sunday, January 25, 2015

Excel is surprisingly powerful

First off, if you didn't know about Excel formulas and column filters, learn about them NOW. I'll update this incrementally as I use an excel feature; I don't want to spend to much time writing this post.

Formulas
Exhibit A: =IF(AND((F2 > 0.09 + (0.68/1.3)*D2), (F2 > 0.025 + 0.8*D2)),1,0)

(TIL about a little summary bar at the bottom of your excel window which tells you the sum. If you set it up with an if statement like the one above, which outputs 1 if true and 0 if false, you can use this real-time-updating info bar to rapidly figure out how many of your selected rows satisfy a condition. Isn't that nifty?)

String concat
It can be done with an &; useful if, for instance, you need to turn "chr", "start", "end" into "chr:start-end" - but if you're doing to much of this you should probably learn inline perl. But this is handy if you already have excel open for some reason.

VLookup
I've usually avoided having to use this but it can be used to do the equivalent of a join - though the far more intuitive way would be to sort the two tables by a common key. And if you're spending too much time on this you should, again, probably do it programmatically. It's just useful to remember.

Conditional formatting
This is really what I wanted to share on this post today. Don't waste time dumping your stuff into a separate heatmap application when a simple conditional formatting will suffice:

(sorry R/G colorbind folks...)

ooooo.

Wednesday, April 30, 2014

Converting a .rpt (or other fixed-width file) to a tsv (or: a crash course in inline perl)

Bottom-Line Up-Front

Execute: head -2 [inputFile].rpt | tail -1 | perl -pe '$_ =~ s/ /\n/g' | perl -pe '$_ = "A".length($_)' | perl -pe '$_ .= "\n"'

Get a string like A11A7A26A101A51A12A12A26A26A26A16

Insert that string into the following command and execute: perl -pe '$_ = ($. == 2) ? "" : do {$_ =~ s/\xef\xbb\xbf//g; $" = "\t"; @arr = unpack("A11A7A26A101A51A12A12A26A26A26A16",$_); "@arr\n"}' [inputFile].rpt > [outputFile].tsv

Overview

You can do this with two perl one-liners.

A .rpt file is the standard export format from Microsoft SQL Server Management Studio. A fixed-width file is one where each column is left padded such that it occupies a fixed number of columns, but there can be spaces within columns or blank columns.

RPT files look like:
columnNameOne columnName2 columnName3
------------- ----------- -----------
row1col1 row1col2 row1col3
row2col1 row2col2 row2col3

Occasionally you will see a funky byte-order-mark at the beginning of the rpt file (http://stackoverflow.com/questions/3255993/how-do-i-remove-i-from-the-beginning-of-a-file). These two lines account for that byte order mark

First Command

The great thing is that we can use the second line in an rpt file to figure out column widths. To get column widths, do this:
Do:
head -2 [inputFile].rpt | tail -1 | perl -pe '$_ =~ s/ /\n/g' | perl -pe '$_ = "A".length($_)' | perl -pe '$_ .= "\n"'
This prints out a string indicating the lengths of each of the columns, eg: "A11A7A26A101A51A12A12A26A26A26A16"

Anatomy of the command

head -2 [inputfile]

Takes the first two lines of the inputfile

tail -1

Takes the last one line of what's fed to it

The "|" operator chains commands - it means "take the output of the previous command and supply it as the input of the next command". The result of "head -2 [inputFile] | tail -1" is to get the second line of [inputFile]

perl -pe '$_ =~ s/ /\n/g'

The -pe flag says 'for every line in the input, read the line into the $_ variable, execute the command in quotes, and print out the final contents of $_'

"=~" means "execute the regex to the right on the variable to the left

"s/ /\n/g" means "replace every space with a newline character. s/.../.../g replaces everything in the first pair of slashes with everything in the second pair of slashes. \n stands for a newline character. so s/ /\n/g replaces spaces with newlines.
So the above command transforms the single line "--- --- ---" into three lines, each containing "---"

perl -pe '$_ = "A".length($_)'

The -pe flag says 'for every line in the file, read the line into the $_ variable, execute the command in quotes, and print out the final contents of $_'

length($_) gives the length of the $_ string (inclusive of newline characters)

The period operator in perl represents string concatenation. So "A".length($_) means "append the length of $_ to the string A"

Note that the final value of $_ does not contain a newline character. So what this command accomplishes is to read in the lengths of every line that is fed into it, append that length to the letter A, and print out the result all as one contiguous string.

So if your input is three lines, each containing "---", your output after this command would be "A3A3A3".

Combined with the previous command, if your input is "--- ---- --", then after "perl -pe '$_ =~ s/ /\n/g' | perl -pe '$_ = "A".length($_)'" you will get "A3A4A2".

perl -pe '$_ .= "\n"'

This just says "take every line in the input and append a newline character". I just do this to make the output of the whole command look nice, because the previous command did not append any terminal newline character.

Second Command

Now that you have your string indicating the widths of all the columns from the first command, execute the following command:
perl -pe '$_ = ($. == 2) ? "" : do {$_ =~ s/\xef\xbb\xbf//g; $" = "\t"; @arr = unpack("A11A7A26A101A51A12A12A26A26A26A16",$_); "@arr\n"}' [inputFile].rpt > [outputFile].tsv

Anatomy of the command

perl -pe '$_ = ($. == 2) ? "" : do {s/\xef\xbb\xbf//g; $" = "\t"; @arr = unpack("A11A7A26A101A51A12A12A26A26A26A16",$_); "@arr\n"}' [inputfile].rpt > [outputFile].tsv

The -pe flag says 'for every line in the input, read the line into the $_ variable, execute the command in quotes, and print out the final contents of $_'

In this case, the 'input' is just the contents of [inputfile]

'$_ = ($. == 2) ? "" : do {s/\xef\xbb\xbf//g; $" = "\t"; @arr = unpack("A11A7A26A101A51A12A12A26A26A26A16",$_); "@arr\n"}'

$_ = ($. == 2) ? "" : ...

The "[condition] ? [1] : [2]" is a ternary operator. It means "if condition is true, return [1] - otherwise, return [2]"

The $. variable stores the line of the input file you are on

So $_ = ($. == 2) ? "" : [2] means "if you are on the second line of the file, store a blank string to $_. Otherwise, store whatever [2] is to $_.

The command above is a way to strip off the second line of the file, which if you recall in an rpt file is just a bunch of dashes.

do {...}

This just means "execute whatever is in the curly braces. Then, return the value of the last thing you evaluated".

$_ =~ s/\xef\xbb\xbf//g

\xef\xbb\xbf refers to the icky byte-order-mark at the beginning of the file that we need to drop. http://stackoverflow.com/questions/3255993/how-do-i-remove-i-from-the-beginning-of-a-file

"=~" means "execute the regex to the right on the variable to the left"

s/.../.../g replaces everything in the first pair of slashes with everything in the second pair of slashes. s/\xef\xbb\xbf//g says "replace the byte-order mark with a blank".

So the above command strips away byte-order marks from $_.

The semicolon operator is used to separate the lines of a perl command.

$" = "\t"

$" stores the value of the 'interpolation' string. Basically, if I want to turn an array into a string, the element of the array will get separated with the contents of whatever is in $". I am setting this to "\t" (the tab character) because I want a tsv.

@arr = unpack("A11A7A26A101A51A12A12A26A26A26A16",$_)

The unpack command is at the heart of my fixed-width to tsv conversion. It basically says "take the contents of $_, and separate the records into an array according to the template "A11A7A26A101A51A12A12A26A26A26A16", then store the result to the @arr variable. More documentation on the unpack command is here: http://perldoc.perl.org/functions/unpack.html

"@arr\n"

This just says "take the elements of the @arr array and convert them to a string by separating them by whatever delimiter is stored in the $" variable (which in a preceding command we set to a tab character), and also append a newline character at the end"

The full consequence of this is to go over each line in the input fixed-width file, skip the line if it's the second line, otherwise convert it to a tab-separated file according to the template in "A11A7A26A101A51A12A12A26A26A26A16".

> [outputfile].tsv

The ">" operator means "take the output of the previous command and write it to the file specified on the right". In this case, the file is outputFile.tsv

Note that an rpt file probably also has lines at the end that say "blah blah blah rows affected" - you probably want to ditch these rows:

To get the number of lines in a file, use "wc -l [name of file]" (that's a dash and a lowercase L)

To take only the first n rows of a file and write them to an output file, use: "head -n [inputfile] > [outputfile]"

So, if you want to ditch the last three lines of a file, first execute "wc -l [name of input file]", note the number returned (eg: 100), and then execute "head -97 [name of input file] > [name of output file]

References

http://stackoverflow.com/questions/6302025/perl-flags-pe-pi-p-w-d-i-t

http://www.perl.com/pub/2004/06/18/variables.html

http://perldoc.perl.org/functions/unpack.html

http://perldoc.perl.org/functions/do.html

More cool perl

This command drops the 2nd-4th and 6th columns from the input tsv:

perl -pe '$_ = do {%exclude = map {$_,1} (2..4,6); @arr = split(/\t/,$_); @arr = map {defined($exclude{$_}) ? () : $arr[$_]} (0..$#arr); $" = "\t"; "@arr"}' input.tsv > output.tsv

The most confusing aspects of the command above that haven't been covered elsewhere in this page are probably:

The creation of the %exclude hash. Hashes are made using the syntax %nameOfHash = (even,sized,array,containing,alternating,key,value,pairs).
The 2..4 syntax: this produces an array with the elements (2,3,4).
The fact that parentheses are used to declare an array; (2,3,4) actually creates an array with those elements.
The map {..} (anarray) syntax. This takes every element of (anarray), stores it to $_ (note: variable scoping is key! This doesn't overwrite the value of $_ in outer blocks!), applies the operation in {...}, and appends the result to a resulting array. In other words, map {$_,1} (2,3,4,6) produces the following array: (2,1,3,1,4,1,6,1). When this is used to create the %exclude hash, it makes a hash where the keys are 2,3,4 and 6 and the values are all 1.
That $arr[$_] returns what is at index $_ of the @arr array (I know, the syntax is awful).
That $#arr returns "size of @arr minus 1"

Saturday, October 22, 2011

Really handy R knowledge

R. It's big, it's great, it's horribly under-documented.

If you don't want to pay for a decidedly Ctrl+F lacking book, and you've tried to self-teach R, chances are you've been very very frustrated at least once.

Here, I'm literally just going to dump handy R-knowledge that was not as easy to find as it should have been.

Jan 25th 2015 note: As I have recently returned to using R after a break, I'm revisiting and updating this list. I'm much less peculiar today than I was when I first wrote this post, so I hope you'll forgive the comparatively subdued tone. I think I picked a very strange list of 'handy' functions.

1. Double brackets vs. single brackets

The first headache-cure you learn. Single brackets returns an object of the datatype of the thing being indexed into, double brackets returns the datatype of the actual thing! The one you want!

example:
> moo = list("v1" = c(1,2,3),"v2" = c(2,3,4))
> moo
$v1
[1] 1 2 3
$v2
[1] 2 3 4
> moo[1]
$v1
[1] 1 2 3
# ^ That was a list! asdfawefsadf
> moo[[1]]
[1] 1 2 3
# ^ That was a vector! yay!

2. Unlist

Sheesh. For solving the same headache as above.
> unlist(moo)
v11 v12 v13 v21 v22 v23
1 2 3 2 3 4
And THAT, children, is how to convert a list into a vector.

3. Append
Can also be used to shove items into a vector at places other than the end.
> hi = append(hi,4)
> hi = append(hi,-1,0)
> hi
[1] -1 1 2 3 4
> hi = append(hi,0,1)
> hi
[1] -1 0 1 2 3 4

4. Cbind
Lets you turn a row vector into a column vector. Handy if you're trying to build up a data frame. Oh god data frames....ok, one thing at a time.
Using hi from the previous example:
> cbind(hi)
hi
[1,] -1
[2,] 0
[3,] 1
[4,] 2
[5,] 3
[6,] 4

Jan 25th 2015: yes, let's talk about those data frames.
4.1. Reading a table into a data frame
> x <- read.table("/path/to/file/fun/fact/R/tab/completes.txt", header=TRUE, as.is=TRUE)
The as.is=TRUE tells R to treat strings as actual strings, rather than as these peculiar things called "factors". I don't know who thought it would be a good idea to default to reading strings as factors. Maybe they enjoy being mean to novices. It sounds like factors could in theory be useful for filtering but in bioinformatics your strings are usually things like gene names or chromosomal coordinates and very rarely a filtering keyword...

4. Subset
Really handy tool. Right now, I have on my workspace a data frame called "correlations". It has 124 rows, and, uh, I want to say 264 columns but I believe R thinks of them differently.
> length(names(correlations))
[1] 264
> length(row.names(correlations))
[1] 124
To apply a selection condition, the first argument is the data frame and the second argument is a condition involving the data frame. The syntax is:
> hi = subset(correlations,correlations[1] > 0.5)
> length(row.names(hi))
[1] 28
here, I selected for all rows where the value in the first column was > 0.5. Another way to do that would have been to use the name of the first column:
> names(correlations[1])
[1] "AHR"
> hi2 = subset(correlations,AHR > 0.5)
> length(row.names(hi2))
[1] 28
Incidentally, while in the example I used the column correlations[1] and placed a logical condition on it, I could have technically used any column vector containing 124 boolean values as my "condition", and the syntax would have worked:
> coolvector = correlations[1] > 0.5
> coolvector[0]#this gives me information about the type of object
logical(0)
> length(coolvector)
[1] 124
> hi = subset(correlations,coolvector)
> length(row.names(hi))
[1] 28
Also, if you have row names, the syntax for using row names to extract information about a particular row from a data frame looks like this (here, I have a row named "AP2_Q6" in the data frame hi2, and I'm accessing the first column in that row):
> hi2["AP2_Q6",1]
[1] 0.7946332

5. eval
Handy, handy, handy for making your code flexible. Lets you interpret a string as a command. Let me demonstrate:
> cutecondition = "happycolumn <- correlations$AHR > 0.5"
> eval(parse(text=cutecondition))
> hi = subset(correlations,happycolumn)
> length(row.names(hi))
[1] 28

Yay! I feel like I've done a service to humanity! Now I need to get back to work.

Peace love and bunnies!
- Av

We exist. We google. We bang heads.

I started this blog (originally called 'undergraduatebioinformaticist', hence the 'we exist' in the tagline above) because in my years of UROP-ing at a bioinformatics facility at MIT, I realized just how hard it is to glean free knowledge about this rather caffeinated field. I put hard effort into deciphering what I know from online forums and patchy documentation. I'm jolly well going to share it.

Post-MIT, I had a year-long stint at a tech company in Silicon Valley, and now I'm pursuing a PhD at Stanford.