22 chromosome comments, RepeatMasker output analysis



Analysis of category Retroviral from RepeatMasker output files.

There are two ways of analysing RepeatMasker output. You can use ".out" file produced by the program (example). This was probably done by the Chromosome 22 paper authors. However, this file contains information about individual fragments and not about complete sequences. Each line of the output represents one hit found by Smith-Waterman local similarity search.

The second (and easy) way is to use tables stored in ".tbl" file (example), where biologically relevant categories are found. Mistakes such as taking all LTR or even MER (it contains LTRs of HERVs, MER4 group sequences, DNA transposons, ...) into one group are here avoided. The program ProcessRepeats (part of RepeatMasker) is able to join some fragments (as on example 1), but fails in more complicated situations (example 2, example 3).

Analysis presented here is based on the ".out" files with corrections made to the HERV category. This corrected HERV category is similar but not identical to Retroviral group in RepeatMasker ".tab" output. However, defining idependent, unique elements, especially in the case of fused elements from different families is uncertain.

When all 1112 elements listed as "LTR/Retroviral" in RepeatMasker ".out" file are taken into account, the overall coverage of retroviral sequences reach 1.81 %:

contig namelength [bp]elementslength [bp]
NT_00103917899305844876
NT_001106 1528065 45 28929
NT_001113 2488860 60 28989
NT_001124 190014 6 1327
NT_001128 379449 37 9298
NT_001454 23203091 785 419893
NT_001487 767357 20 9446
NT_001834 290506 6 3953
NT_002319 992829 38 16059
NT_002446 234226 4 2283
NT_002447 406225 12 10003
NT_002448 1397168 41 29396
CH22336677201112604452

example of ".out" file

   SW  perc perc perc  query      position in query         matching          repeat                position in  repeat
score  div. del. ins.  sequence    begin    end   (left)    repeat            class/family           begin  end (left)

 1828  17.8  1.9  1.9  NT_002446     880   1418 (232808) C  MER4D             LTR/MER4-group         (299)  718    234  
 2214  16.2  4.4  4.4  NT_002446    1429   1860 (232366) +  MER4B             LTR/MER4-group             1  445  (166)  
 2096  12.0  0.3  0.3  NT_002446    1884   2167 (232059) +  AluSx             SINE/Alu                   1  285   (27)  
  288   0.0  0.0  0.0  NT_002446    2168   2199 (232027) +  (CAAA)n           Simple_repeat              2   33    (0)  
 2026  16.2  1.7  1.7  NT_002446    2261   2612 (231614) C  MER4B             LTR/MER4-group          (73)  538    186  
 1050  13.9  1.3  1.3  NT_002446    2647   2804 (231422) C  MER4A             LTR/MER4-group           (7)  653    494  
  414  13.5  2.2  2.2  NT_002446    2808   2896 (231330) +  MER4B             LTR/MER4-group             1   86  (525)  
  226  12.7  5.5  5.5  NT_002446    2889   2943 (231283) +  MER4B             LTR/MER4-group           298  351  (260) *
 2222  17.6  6.0  6.0  NT_002446    7806   8191 (226035) +  MSTA              LTR/MaLR                  19  426    (0)  
 1153   9.6  0.0  0.0  NT_002446    8192   8357 (225869) +  MSTA-internal     LTR/MaLR                   1  166 (1485) *
 7428  19.2  1.8  1.8  NT_002446    8352   9682 (224544) +  MSTA-internal     LTR/MaLR                 289 1636   (15)  
 1813  20.8  6.9  6.9  NT_002446    9700  10107 (224119) +  MSTA              LTR/MaLR                   1  426    (0)  
 1919  16.8  2.7  2.7  NT_002446   10117  10486 (223740) +  MSTB              LTR/MaLR                   2  370   (56)  
  657   6.9  0.0  0.0  NT_002446   10494  10580 (223646) +  (TTTC)n           Simple_repeat              2   88    (0)  
 1765  14.8  0.7  0.7  NT_002446   10584  10874 (223352) C  AluSg             SINE/Alu                 (1)  309     23  
  325  15.9  1.6  1.6  NT_002446   10875  10937 (223289) +  MSTB              LTR/MaLR                 359  421    (5)  
  625  23.0  1.7  1.7  NT_002446   11567  11750 (222476) +  MER4D             LTR/MER4-group           236  407  (390) *
 1026  25.4  0.6  0.6  NT_002446   11749  12078 (222148) +  MER4D             LTR/MER4-group           672  981   (36)  
   29   0.0  0.0  0.0  NT_002446   12349  12377 (221849) +  AT_rich           Low_complexity             1   29    (0)  
   21   3.6  0.0  0.0  NT_002446   12525  12552 (221674) +  AT_rich           Low_complexity             1   28    (0)  
 1554  17.1  3.5  3.5  NT_002446   12951  13236 (220990) +  AluJo             SINE/Alu                   1  292   (20)  
 2057  15.2  6.5  6.5  NT_002446   14136  14517 (219709) +  MSTA              LTR/MaLR                  19  414   (12)  
 2104  22.6  8.7  8.7  NT_002446   14518  15158 (219068) +  MSTA-internal     LTR/MaLR                   4  667  (913)  
   26   0.0  0.0  0.0  NT_002446   15168  15193 (219033) +  AT_rich           Low_complexity             1   26    (0)  
 2176  11.8  0.0  0.0  NT_002446   15198  15484 (218742) C  AluY              SINE/Alu                (22)  289      3  
 3575   7.3  1.2  1.2  NT_002446   15493  15986 (218240) C  LTR15             LTR/Retroviral           (6)  487      2  
 2347  19.4  1.0  1.0  NT_002446   15990  16866 (217360) +  MSTA-internal     LTR/MaLR                 725 1651    (0)  
 2292  18.0  3.4  3.4  NT_002446   16867  17283 (216943) +  MSTA              LTR/MaLR                   1  426    (0)  
 2333  18.5  2.4  2.4  NT_002446   17528  17991 (216235) C  MER65C            LTR/MER4-group           (0)  461      1  
  337  19.2  7.7  7.7  NT_002446   17992  18121 (216105) C  MER65-internal    LTR/MER4-group           (0) 4871   4746  
  646  25.2  5.1  5.1  NT_002446   18162  18375 (215851) C  MER65-internal    LTR/MER4-group        (2776) 2095   1875  
 3754  20.8  0.9  0.9  NT_002446   18379  19163 (215063) +  L1PA15-16         LINE/L1                 -706   57 (6089)  
  987  28.5  5.0  5.0  NT_002446   19665  20323 (213903) +  L1PA13            LINE/L1                  852 1508 (4638)  
 1879  16.8  0.0  0.0  NT_002446   20324  20614 (213612) C  AluSx             SINE/Alu                (21)  291      1  
...

example of ".tab" file

==================================================
file name: NT_001039
sequences:          1
total length: 1789930 bp
GC level:       50.94 %
bases masked   688369 bp ( 38.46 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:            1234      313498 bp    17.51 %
      ALUs        1110      298568 bp    16.68 %
      MIRs         124       14930 bp     0.83 %

LINEs:             414      224295 bp    12.53 %
      LINE1        299      194922 bp    10.89 %
      LINE2        104       26258 bp     1.47 %

LTR elements:      172       88611 bp     4.95 %
      MaLRs         74       22822 bp     1.28 %
      Retrov.       58       44876 bp     2.51 %
      MER4_group    30       17601 bp     0.98 %

DNA elements:       94       28524 bp     1.59 %
      MER1_type     56       11015 bp     0.62 %
      MER2_type     32       17047 bp     0.95 %
      Mariners       0           0 bp     0.00 %

Unclassified:        2         392 bp     0.02 %

Total interspersed repeats: 655320 bp    36.61 %


Small RNA:           3         514 bp     0.03 %

Satellites:          0           0 bp     0.00 %
Simple repeats:    288       27682 bp     1.55 %
Low complexity:    114        5802 bp     0.32 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element

The sequence(s) were assumed to be of primate origin.
RepeatMasker version  04/21/99               default
ProcessRepeats version  04/21/99


Main page
Webmaster
Last modified: $Date: 2001/10/05 11:21:07 $ site: herv.img.cas.cz