Arid
DOI10.1093/bioinformatics/btr447
SEED: efficient clustering of next-generation sequences
Bao, Ergude2; Jiang, Tao2; Kaloshian, Isgouhi3; Girke, Thomas1
通讯作者Girke, Thomas
来源期刊BIOINFORMATICS
ISSN1367-4803
EISSN1460-2059
出版年2011
卷号27期号:18页码:2502-2509
英文摘要

Motivation: Similarity clustering of next generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.


Results: Here, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in < 4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60-85% and 21-41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12-27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED’s utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms.


类型Article
语种英语
国家USA
收录类别SCI-E
WOS记录号WOS:000294755400005
WOS关键词GENOME ; PROGRAM ; PROTEIN ; SEARCH ; FORMAT ; FASTER ; RNAS ; TOOL
WOS类目Biochemical Research Methods ; Biotechnology & Applied Microbiology ; Computer Science, Interdisciplinary Applications ; Mathematical & Computational Biology ; Statistics & Probability
WOS研究方向Biochemistry & Molecular Biology ; Biotechnology & Applied Microbiology ; Computer Science ; Mathematical & Computational Biology ; Mathematics
资源类型期刊论文
条目标识符http://119.78.100.177/qdio/handle/2XILL650/167363
作者单位1.Univ Calif Riverside, Dept Bot & Plant Sci, Riverside, CA 92521 USA;
2.Univ Calif Riverside, Dept Comp Sci & Engn, Riverside, CA 92521 USA;
3.Univ Calif Riverside, Dept Nematol, Riverside, CA 92521 USA
推荐引用方式
GB/T 7714
Bao, Ergude,Jiang, Tao,Kaloshian, Isgouhi,et al. SEED: efficient clustering of next-generation sequences[J],2011,27(18):2502-2509.
APA Bao, Ergude,Jiang, Tao,Kaloshian, Isgouhi,&Girke, Thomas.(2011).SEED: efficient clustering of next-generation sequences.BIOINFORMATICS,27(18),2502-2509.
MLA Bao, Ergude,et al."SEED: efficient clustering of next-generation sequences".BIOINFORMATICS 27.18(2011):2502-2509.
条目包含的文件
条目无相关文件。
个性服务
推荐该条目
保存到收藏夹
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[Bao, Ergude]的文章
[Jiang, Tao]的文章
[Kaloshian, Isgouhi]的文章
百度学术
百度学术中相似的文章
[Bao, Ergude]的文章
[Jiang, Tao]的文章
[Kaloshian, Isgouhi]的文章
必应学术
必应学术中相似的文章
[Bao, Ergude]的文章
[Jiang, Tao]的文章
[Kaloshian, Isgouhi]的文章
相关权益政策
暂无数据
收藏/分享

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。