Knowledge Resource Center for Ecological Environment in Arid Area
DOI | 10.1093/bioinformatics/btr447 |
SEED: efficient clustering of next-generation sequences | |
Bao, Ergude2; Jiang, Tao2; Kaloshian, Isgouhi3; Girke, Thomas1 | |
通讯作者 | Girke, Thomas |
来源期刊 | BIOINFORMATICS
![]() |
ISSN | 1367-4803 |
EISSN | 1460-2059 |
出版年 | 2011 |
卷号 | 27期号:18页码:2502-2509 |
英文摘要 | Motivation: Similarity clustering of next generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads. Results: Here, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in < 4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60-85% and 21-41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12-27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED’s utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms. |
类型 | Article |
语种 | 英语 |
国家 | USA |
收录类别 | SCI-E |
WOS记录号 | WOS:000294755400005 |
WOS关键词 | GENOME ; PROGRAM ; PROTEIN ; SEARCH ; FORMAT ; FASTER ; RNAS ; TOOL |
WOS类目 | Biochemical Research Methods ; Biotechnology & Applied Microbiology ; Computer Science, Interdisciplinary Applications ; Mathematical & Computational Biology ; Statistics & Probability |
WOS研究方向 | Biochemistry & Molecular Biology ; Biotechnology & Applied Microbiology ; Computer Science ; Mathematical & Computational Biology ; Mathematics |
资源类型 | 期刊论文 |
条目标识符 | http://119.78.100.177/qdio/handle/2XILL650/167363 |
作者单位 | 1.Univ Calif Riverside, Dept Bot & Plant Sci, Riverside, CA 92521 USA; 2.Univ Calif Riverside, Dept Comp Sci & Engn, Riverside, CA 92521 USA; 3.Univ Calif Riverside, Dept Nematol, Riverside, CA 92521 USA |
推荐引用方式 GB/T 7714 | Bao, Ergude,Jiang, Tao,Kaloshian, Isgouhi,et al. SEED: efficient clustering of next-generation sequences[J],2011,27(18):2502-2509. |
APA | Bao, Ergude,Jiang, Tao,Kaloshian, Isgouhi,&Girke, Thomas.(2011).SEED: efficient clustering of next-generation sequences.BIOINFORMATICS,27(18),2502-2509. |
MLA | Bao, Ergude,et al."SEED: efficient clustering of next-generation sequences".BIOINFORMATICS 27.18(2011):2502-2509. |
条目包含的文件 | 条目无相关文件。 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。