干旱区生态环境知识资源中心(Arid): SEED: efficient clustering of next-generation sequences

Arid

DOI	10.1093/bioinformatics/btr447
	SEED: efficient clustering of next-generation sequences
	Bao, Ergude 2; Jiang, Tao 2; Kaloshian, Isgouhi 3; Girke, Thomas 1
通讯作者	Girke, Thomas
来源期刊	BIOINFORMATICS
ISSN	1367-4803
EISSN	1460-2059
出版年	2011
卷号	27 期号:18 页码:2502-2509
英文摘要	Motivation: Similarity clustering of next generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads. Results: Here, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in < 4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60-85% and 21-41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12-27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED’s utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms.
类型	Article
语种	英语
国家	USA
收录类别	SCI-E
WOS记录号	WOS:000294755400005
WOS关键词	GENOME ; PROGRAM ; PROTEIN ; SEARCH ; FORMAT ; FASTER ; RNAS ; TOOL
WOS类目	Biochemical Research Methods ; Biotechnology & Applied Microbiology ; Computer Science, Interdisciplinary Applications ; Mathematical & Computational Biology ; Statistics & Probability
WOS研究方向	Biochemistry & Molecular Biology ; Biotechnology & Applied Microbiology ; Computer Science ; Mathematical & Computational Biology ; Mathematics
资源类型	期刊论文
条目标识符	http://119.78.100.177/qdio/handle/2XILL650/167363
作者单位	1.Univ Calif Riverside, Dept Bot & Plant Sci, Riverside, CA 92521 USA; 2.Univ Calif Riverside, Dept Comp Sci & Engn, Riverside, CA 92521 USA; 3.Univ Calif Riverside, Dept Nematol, Riverside, CA 92521 USA
推荐引用方式 GB/T 7714	Bao, Ergude,Jiang, Tao,Kaloshian, Isgouhi,et al. SEED: efficient clustering of next-generation sequences[J],2011,27(18):2502-2509.
APA	Bao, Ergude,Jiang, Tao,Kaloshian, Isgouhi,&Girke, Thomas.(2011).SEED: efficient clustering of next-generation sequences.BIOINFORMATICS,27(18),2502-2509.
MLA	Bao, Ergude,et al."SEED: efficient clustering of next-generation sequences".BIOINFORMATICS 27.18(2011):2502-2509.