Scalable Bioinformatics Services

I have previously [Bailey and Machanick 2012, Machanick and Bailey 2011, Bailey et al. 2010] worked on the MEME suite of web services, which encompasses a wide range of bioinformatics tools with an emphasis on DNA sequence analysis (though some tools also support proteins). The MEME tools contrast with the Galaxy web service [Goecks et al. 2010]. MEME is a more focused tool chain with a simple point and click interface. Though it can combine tools in a number of useful ways, it has no way to construct your own pipelines. Galaxy on the other hand is more open-ended, and allows construction of pipelines based on combining existing tools, and provides a framework for adding new tools.

There are two gaps in current offerings:

the MEME tools are extremely easy to use, while Galaxy provides greater generality at the cost of more complexity – the gap here is between MEME’s ease of use and Galaxy’s configurability
neither web service has a simple model to leverage cloud resources; both can either run on a shared server, with capacity limits for free use (e.g. the free Galaxy server has a disk quota per user), or require installation on the user’s own infrastructure. Galaxy offers expanded use using Amazon EC2, but requires complex configuration. To use the MEME tools on cloud infrastructure, you would have to create an instance yourself.

Example of a MEME web service	Example of a Galaxy workflow

The aim of this project is to investigate:

combining the benefits of the MEME and Galaxy services: Galaxy’s configurability and MEME’s ease of use. This requires finding a new compromise in the design space, with easier to use scripting and pipeline construction. We should test usability claims versus Galaxy with the help of working biologists; some approaches to investigate include:
- methods for specifying pipelines based on function rather than tools
- methods for documenting workflows that are not purely dumps of scripts and programs run
- methods for specifying tool wrappers by example rather than creating yet another script-like syntax (or yet another flowchart-like visual workflow notation [Byelas and Swertz 2006])
transparent transition from local server to cloud-based execution, requiring only payment credentials to use cloud services.

References

[Bailey and Machanick 2012] Timothy L. Bailey and Philip Machanick, Inferring direct DNA binding from ChIP-seq, Nucleic Acids Research, vol. 40, no. 17 September 2012 pp. e128 (10 pages); first published online: 18 May 2012
[Bailey et al. 2010] Timothy L. Bailey, Mikael Bodén, Tom Whitington and Philip Machanick, The value of position-specific priors in motif discovery using MEME, BMC Bioinformatics, vol. 11 p179, 2010
[Byelas and Swertz 2006] Byelas, H. V., and M. A. Swertz. Visualization of bioinformatics workflows for ease of understanding and design activities, Proc. 6th Int. Joint Conf. on Biomedical Engineering Systems and Technologies, 2006
[Goecks et al. 2010] Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences, Genome Biol. 2010 Aug 25;11(8):R86
[Machanick and Bailey 2011] Machanick and Timothy L. Bailey, MEME-ChIP: motif analysis of large DNA datasets, Bioinformatics, vol. 27 no. 12, pp 1696-1697, 2011