红联Linux门户
Linux帮助

求助:哪位高手知道如何通过基因编号提取序列

发布时间:2009-05-24 03:12:53来源:红联作者:huangqp
[i=s] 本帖最后由 huangqp 于 2009-5-24 03:18 编辑 [/i]

哪位知道如何利用基因号提取全序列
一个文件file1含有基因号,如下,
AT5G54820.1
AT5G32220.1
AT5G20470.1

下面是一个database文件file2,想从中通过上面文件所包含的基因号从下面的文件中提取相应的序列。

>AT5G54820.1 | Symbols: | scws protein | chr1:19049283-19050416 FORWARD
QRDLAKDRPNASGLQEVLSHFKCLDIDNDPSCI
>AT5G20470.1 | Symbols: | ffdffs protein | chr1:19049283-19050416 FORWARD
AINHLCICQANQASSVAWGVHAFSRKTQPLDN
>AT5G42150.1 | Symbols: | dwefs protein | chr1:19049283-19050416 FORWARD
CCIADYEKGDKITCKFHDCIAENKCPMFHSSVR
>AT4G02570.1 | Symbols: | fghhfs protein | chr1:19049283-19050416 FORWARD
CSICLQSLVSSSKTRMSHHNGLVELNRCPMFH
>AT5G32220.1 | Symbols: | shrb protein | chr1:19049283-19050416 FORWARD
CSICMENLNSESSSENIISCLHLFHQSCIFESESS

以第一条序列为例,解释一下这里所谓的一条序列,即包括:注释+序列
注释:
>AT5G54820.1 | Symbols: | scws protein | chr1:19049283-19050416 FORWARD
序列:
QRDLAKDRPNASGLQEVLSHFKCLDIDNDPSCI

简言之,现在想通过file1所包含的AT5G54820.1,从file2中提出其对应的全序列,即:

>AT5G54820.1 | Symbols: | scws protein | chr1:19049283-19050416 FORWARD
QRDLAKDRPNASGLQEVLSHFKCLDIDNDPSCI

当然所有file1中的基因号在file2中都有!

希望通过常用命令完成,而不是用写程序的方式!请高手指点,多谢!!!!!!!!!
文章评论

共有 3 条评论

  1. 刘冲 于 2009-05-24 09:59:46发表:

    为啥用英语?锻炼写作吗?

  2. Itatmn 于 2009-05-24 09:49:01发表:

    高深

  3. linuxcn 于 2009-05-24 06:34:20发表:

    Hi
    The better way to do your work is to use perl (in my mind). here is a simple script in file file12.pl

    at a terminal call this command

    perl file12.pl file1 file2

    the output looks like this

    5 lines
    AT5G54820.1 QRDLAKDRPNASGLQEVLSHFKCLDIDNDPSCI
    AT5G32220.1 CSICMENLNSESSSENIISCLHLFHQSCIFESESS
    AT5G20470.1 AINHLCICQANQASSVAWGVHAFSRKTQPLDN


    you can put the output to a new file "output", as

    perl file12.pl file1 file2 > output

    I strongly encourage you to try perl for this type of works.

    good luck

    Yuan






    #!/bin/perl
    $f1=$ARGV[0];
    $f2=$ARGV[1];
    open (F2, "<$f2");
    $n=0;
    while () {
    s/^\s+//;
    m/^>/ && $n++;
    }
    print "$n lines\n";
    seek (F2, 0,0);
    my %seq;
    foreach (1 .. $n) {
    $l1=;
    $l2=;
    chomp $l1;
    chomp $l2;
    @a = split (/\|/, $l1);
    $a[0] =~ s/^>//;
    $a[0] =~ s/\s+//;
    $seq{$a[0]} = $l2;
    #print "$a[0],$l2,$a[1]\n";
    }
    close (F2);
    open (F1, "<$f1");
    while () {
    s/^\s+//;
    chomp;
    print "$_ $seq{$_}\n";
    }
    close (F1);
    close (F2);
    exit 0;