

#Name mangler sequence md5 free#
SEED), and c) mapping onto abstract protein families mixed with free text (e.g.

GenBank), b) mappings onto carefully curated functional namespaces (e.g. Annotation can exist in many forms, including a) free text (e.g. The metadata contain sequence identifiers, potential species identifiers, and annotations. Our approach separates sequence data from metadata as the sequence and annotation data made available by different groups can logically be split into sequence data and metadata. We have developed a non-redundant protein database (MD5nr) based on the use of MD5 checksums. We anticipate more adopters of this resource and approach, like IMG/M. This approach will allow multiple interpretations of a single similarity search rather than having to run separate searches against various individual databases. With multiple groups offering re-annotation of complete genomes, protein families, or analyses of metagenomes, an efficient way to reduce the overall resource consumption is by using a collapsed sequence database. One example of this cross-linking is the ability to map the abundance of metagenomic reads onto COG categories, and then compare these to the same reads mapped onto SEED subsystems. However, these are provided via web-based services and do not lend themselves to efficient local queries at the rates of several hundred thousand identifiers per second that are used by systems like MG-RAST to “translate” large numbers of sequences from one “namespace” into another. Many groups have created sequence identifier-based mechanisms for cross-linking database annotations (see e.g. KEGG Orthologs, SEED FIGfams, COGs or EGGnog ) (re-) annotation efforts are not captured by NCBI's nr. KEGG, SEED, IMG ) and protein family (eg. Researchers interested in enzyme numbers (ECs) or SEED subsystem identifiers are forced to repeat similarity searches (against the same body of proteins). All (in INSDC parlance) third-party annotations remain excluded. Adoption of a needed common reference is currently complicated by the fact that NCBI's non-redundant protein database (“nr”) captures only a single annotation. However, with comparative reference databases growing rapidly, a valuable addition to metagenomic analysis would be the ability to compute similarity searches only once and then exchange the results. Replacing the algorithms used to perform similarity searches (most commonly BLAST ) with more efficient algorithms like BLAT will provide a much needed reduction in analysis cost. Widely used metagenome analysis systems like IMG/M and MG-RAST employ substantial computational resources while computing sequence similarity results. In some research projects, namely metagenomics, the computational costs of similarity searching rapidly outstrips the cost of sequencing. Similarity searches are potentially the most widely used type of sequence analysis.
