[i=s] 本帖最后由 huangqp 于 2009-5-24 03:18 编辑 [/i]
哪位知道如何利用基因号提取全序列
一个文件file1含有基因号,如下,
AT5G54820.1
AT5G32220.1
AT5G20470.1
下面是一个database文件file2,想从中通过上面文件所包含的基因号从下面的文件中提取相应的序列。
>AT5G54820.1 | Symbols: | scws protein | chr1:19049283-19050416 FORWARD
QRDLAKDRPNASGLQEVLSHFKCLDIDNDPSCI
>AT5G20470.1 | Symbols: | ffdffs protein | chr1:19049283-19050416 FORWARD
AINHLCICQANQASSVAWGVHAFSRKTQPLDN
>AT5G42150.1 | Symbols: | dwefs protein | chr1:19049283-19050416 FORWARD
CCIADYEKGDKITCKFHDCIAENKCPMFHSSVR
>AT4G02570.1 | Symbols: | fghhfs protein | chr1:19049283-19050416 FORWARD
CSICLQSLVSSSKTRMSHHNGLVELNRCPMFH
>AT5G32220.1 | Symbols: | shrb protein | chr1:19049283-19050416 FORWARD
CSICMENLNSESSSENIISCLHLFHQSCIFESESS
以第一条序列为例,解释一下这里所谓的一条序列,即包括:注释+序列
注释:
>AT5G54820.1 | Symbols: | scws protein | chr1:19049283-19050416 FORWARD
序列:
QRDLAKDRPNASGLQEVLSHFKCLDIDNDPSCI
简言之,现在想通过file1所包含的AT5G54820.1,从file2中提出其对应的全序列,即:
>AT5G54820.1 | Symbols: | scws protein | chr1:19049283-19050416 FORWARD
QRDLAKDRPNASGLQEVLSHFKCLDIDNDPSCI
当然所有file1中的基因号在file2中都有!
希望通过常用命令完成,而不是用写程序的方式!请高手指点,多谢!!!!!!!!!
刘冲 于 2009-05-24 09:59:46发表:
为啥用英语?锻炼写作吗?
Itatmn 于 2009-05-24 09:49:01发表:
高深
linuxcn 于 2009-05-24 06:34:20发表:
Hi
) {
;
;
) {
The better way to do your work is to use perl (in my mind). here is a simple script in file file12.pl
at a terminal call this command
perl file12.pl file1 file2
the output looks like this
5 lines
AT5G54820.1 QRDLAKDRPNASGLQEVLSHFKCLDIDNDPSCI
AT5G32220.1 CSICMENLNSESSSENIISCLHLFHQSCIFESESS
AT5G20470.1 AINHLCICQANQASSVAWGVHAFSRKTQPLDN
you can put the output to a new file "output", as
perl file12.pl file1 file2 > output
I strongly encourage you to try perl for this type of works.
good luck
Yuan
#!/bin/perl
$f1=$ARGV[0];
$f2=$ARGV[1];
open (F2, "<$f2");
$n=0;
while (
s/^\s+//;
m/^>/ && $n++;
}
print "$n lines\n";
seek (F2, 0,0);
my %seq;
foreach (1 .. $n) {
$l1=
$l2=
chomp $l1;
chomp $l2;
@a = split (/\|/, $l1);
$a[0] =~ s/^>//;
$a[0] =~ s/\s+//;
$seq{$a[0]} = $l2;
#print "$a[0],$l2,$a[1]\n";
}
close (F2);
open (F1, "<$f1");
while (
s/^\s+//;
chomp;
print "$_ $seq{$_}\n";
}
close (F1);
close (F2);
exit 0;