Brian Haas <bhaas@broad.mit.edu>

PASA_logo

PASA, acronym for Program to Assemble Spliced Alignments, is a eukaryotic genome annotation tool that exploits spliced alignments of expressed transcript sequences to automatically model gene structures, and to maintain gene structure annotation consistent with the most recently available experimental sequence data. PASA also identifies and classifies all splicing variations supported by the transcript alignments.

Table of Contents

Introduction

PASA was originally developed at The Institute for Genomic Research in 2002 as an effort to automatically improve gene structures in Arabidopsis thaliana. Since then, it has been applied to numerous Eukaryotic genome annotation projects including Rice, Aspergillus species, Plasmodium falciparum, Schistosoma mansoni, and Aedes aegypti. We have also successfully applied it to lesser extents in both Mouse and Human among others.

Functions of PASA include:

PASA is composed of a pipeline of utilities that perform the following ordered set of tasks:

System Overview

PASA runs on a UNIX/LINUX-based architecture. PASA involves components written in Perl and C++. Utilities used by PASA, including GMAP, are wrapped by Perl code. Results are stored primarily within a MySQL database, and are available for analysis using the companion suite of Web-based tools and command-line utilities. Running PASA to generate alignment assemblies requires only two inputs: a multi-fasta file for the genome and a multi-fasta file for the transcripts; if FL-cDNAs are available, then an additional file containing listing the accessions of those FL-cDNAs can be provided as a third input. In order to compare the assemblies to existing gene structure annotations, preexisting gene structure annotations can be provided in GFF3 format, or imported by a user-customized data adapter (described below).

Sample data and a preconfigured complete PASA pipeline are available for demonstration purposes, all included in the software distribution.

Obtaining PASA

Download the latest version of the PASA software straight from Sourceforge

or (preferred) pull the bleeding-edge code using CVS like so:

% cvs -d :pserver:anonymous@pasa.cvs.sourceforge.net:/cvsroot/pasa co PASA

Software Installation Instructions

Prerequisite Software Components

In addition to the PASA software obtained here, you will need the following:

Note
The utilities provided by each software package above should be available via your PATH setting.

Unravelling the PASA distribution

Move the PASA distribution to a location on your filesystem that we can call PASAHOME, such as /usr/local/bin/PASA. From henceforth, we'll refer to this location as $PASAHOME.

The PASA distribution includes the following utilities that you should build and centrally install:

Configuring the PASA Pipeline

After installing each of the software tools above, all that is needed before running PASA is to configure it. The PASA configuration relies on the file: $PASAHOME/pasa_conf/conf.txt

A template configuration file is provided at
$PASAHOME/pasa_conf/pasa.CONFIG.template
Simply copy pasa.CONFIG.template to conf.txt and set the values for your MySQL database settings.  You only need concern yourself with the following values:
PASA_ADMIN_EMAIL=(your email address)
MYSQLSERVER=(your mysql server name)
MYSQL_RO_USER=(mysql read-only username)
MYSQL_RO_PASSWORD=(mysql read-only password)
MYSQL_RW_USER=(mysql all privileges username)
MYSQL_RW_PASSWORD=(mysql all privileges password)

Setting Up the PASA Web Portal (optional, but highly recommended)

Recursively copy (cp -r) the $PASAHOME area to the cgi-bin directory of your webserver. Change permissions on everything so that it is world executable (ie. % chmod -R 755 ./PASA ) Now, visit the URL for the status report page for the pasa database you created during the pasa run above.

http://yourServerName/cgi-bin/PASA/cgi-bin/status_report.cgi?db=$mysqldb

This will provide some summary statistics and links to additional web-based utilities for navigating the results from your pasa run.

Now that you have a URL for your base PASA url, update your original configuration file at: $PASAHOME/pasa_conf/conf.txt to set the value of BASE_PASA_URL=http://yourServerName/cgi-bin/PASA/cgi-bin/

For more info, visit the Tour of the PASA web portal

Running the Alignment Assembly Pipeline

Step A: cleaning the transcript sequences

Have each of these files in the same working directory. Then, run the seqclean utility on you transcripts like so:

% seqclean  transcripts.fasta

If you have a database of vector sequences (ie. UniVec), you can screen for vector as part of the cleaning process by running the following instead:

% seqclean  transcripts.fasta -v /path/to/your/vectors.fasta

This will generate several output files including transcripts.fasta.cln and transcripts.fasta.clean Both of these can be used as inputs to PASA.

Step B: Walking Thru A Complete Example Using the Provided Sample Data

For my PASA primary configuration file, I'm using $PASAHOME/pasa_conf/sample_test.conf symlinked to conf.txt in that same directory. In it, the mysql server and user/password information our set. This configuration file need only be established once per PASA software installation. You should configure your conf.txt file immediately if you haven't done so already.

Sample inputs are provided in the $PASAHOME/sample_data directory. We'll use these inputs to demonstrate the breadth of the software application, including using sample DATA ADAPTERs to import existing gene annotations into the database, and tentative structural updates out.

The PASA pipeline requires separate configuration files for the alignment assembly and later annotation comparison steps, and these are configured separately for each run of the PASA pipeline, setting parameters to be used by the various tools and processes executed within the PASA pipeline. Configuration file templates are provided as $PASAHOME/pasa_conf/pasa.alignAssembly.Template.txt and $PASAHOME/pasa_conf/pasa.annotationCompare.Template.txt, and these will be further described when used below.

The next steps explain the current contents of the sample_data directory. You need not redo these operations:

The following steps, you must execute in order to demonstrate the software.

Transcript alignments followed by alignment assembly

The output from this step includes several output files that are written:

A couple of additional files that are sometimes useful include the following: (here is where you go when you want to find out why certain alignments didn't make it into PASA assemblies or why they failed validation):

Annotation Comparisons and Annotation Updates

Incorporating PASA Assemblies into Existing Gene Predictions, Changing Exons, Adding UTRs and Alternatively Spliced Models

The PASA software can update any preexisting set of protein-coding gene annotations to incorporate the PASA alignment evidence, correcting exon boundaries, adding UTRs, and models for alternative splicing based on the PASA alignment assemblies generated above.

Loading your preexisting protein-coding gene annotations

Comparing to and updating existing gene structure annotations requires that we import these annotations into the PASA database, and are able to extract the suggested updates. PASA utlizes annotation data adapters to achieve this. GFF3 data adapters are included in the PASA distribution, but you can write your own, and directly tie the PASA pipeline to your own informatics infrastructure (ie. other relational database). If you'd prefer to not use GFF3 and to write your own data adapters, visit the PASA data adapter cookbook.

A sample gff3-formatted annotation file is provided in our sample_data directory as orig_annotations_sample.gff3 and can be loaded like so:

% ../scripts/Load_Current_Gene_Annotations.dbi -c alignAssembly.config -g genome_sample.fasta -P orig_annotations_sample.gff3

Before loading your own GFF3-formatted annotation files, be sure to check them for PASA compatibility like so:

% ../misc_utilities/pasa_gff3_validator.pl orig_annotations_sample.gff3

The above gff3-validator will report any entries in your gff3 file that it does not recognize, understand, or otherwise parse properly. It's not a general purpose gff3-validator since it cares only about your protein-coding genes. (note that you should only feed protein-coding genes to PASA using the loader above).

Performing an annotation comparison and generating an updated gene set

Now that the original annotations are loaded, we can perform a comparison of the PASA alignment assemblies to these preexisting gene annotations, to identify cases where updates can be automatically performed to gene structures in order to incorporate the transcript alignments.

I've copied the ../pasa_conf/pasa.annotationCompare.Template.txt file to our working directory as annotCompare.config. Then, I replaced the MYSQLDB=<MYSQLDB> line with MYSQLDB=pasa_sample_db as before with the alignAssembly.config file. Notice this config file contains numerous parameters that can be modified to tune the process to any genome of interest. We'll leave these values untouched for now, relying on the defaults used by PASA, and we'll revisit parameterization later. For most purposes, the defaults are well suited. Run the annotation comparison like so:

% ../scripts/Launch_PASA_pipeline.pl -c annotCompare.config -A -g genome_sample.fasta -t all_transcripts.fasta.clean

Once the annotation comparison is complete, PASA will output a new GFF3 file that contains the PASA-updated version of the genome annotation, including those gene models successfully updated by PASA, and those that remained untouched. This file will be named output.gene_structures_post_PASA_updates.$pid.gff3, where $pid is the process ID for this annotation comparison computation.

You should revisit the status_report.cgi web page as described above under Setting Up the PASA Web Portal. There, you will be able to navigate the results of the comparison and examine the classifications for annotation updates assigned to each pasa alignment assembly.

Note
It usually requires at least two cycles of annotation loading, annotation comparison, and annotation updates in order to maximize the incorporation of transcript alignments into gene structures. Updates made to gene structures in the first round often lead to the capacity to incorporate additional transcript alignments that did not fit well in the context of the earlier gene structures. You can use the PASA-updated annotations in the GFF3 file created at the end of the annotation comparison step as input for a subsequent annotation comparison round. All of the results from the separate annotation comparison rounds remain accessible via the PASA web portal (see below).

Tour of the PASA web portal

The results from running PASA on our sample data set can be examined via the PASA web portal. For example purposes, I've saved a few of the reports generated by the PASA web displays (note pages are generated on-the-fly, however these are provided as static only for example purposes).

The above are just a few examples. Install the PASA portal and navigate your PASA results.

Polyadenylation Sites Mapped to the Genome

If seqclean was used to clean the transcript sequences, and both the cleaned and original transcript databases were provided in the alignment assembly run of the PASA pipeline as described, then the polyadenylation sites as evidenced in the original transcript sequences and identified as part of the seqclean process were mapped to the genome. The termini of the polyadenylated transcripts are compared to the genome, and those transcripts that truly appear to be polyadenylated and not resulting from an artifact of internal priming to an A-rich region, are reported as candidate polyA sites. The genome coordinate reported as the polyA site is the nucleotide to which polyA is added, so it corresponds to the last non-polyA nucleotide of the polyadenylated transcript. An example of a candidate polyA site can be extracted from one of the output files (default output.polyAsite_analysis.out) like so:

// cdna:gi|51968615|dbj|AK175237.1|, annotdb_asmbl_id:68712, polyAcoord:50443, transcribedOrient:+, rend
CGCTTCTTATattacagggt
CGCTTCTTATAAAAAAAAAA       gi|51968615|dbj|AK175237.1|  TransOrient (+)
trimmedSeq:
          AAAAAAAAAA
OK polyA site candidate.

An additional fasta file (default output.polyAsites.fasta) summarizes all mapped polyA sites supported by the transcripts. A 100 bp segment of the genome sequence is extracted and oriented, and the last nucleotide in uppercase corresponds to the residue to which polyA is added in the processed transcript. The site corresponding to our example above is as follows:

>68712-50443_+ 1 transcripts: gi|51968615|dbj|AK175237.1|
ATCGACCACCCTCTTTTTTATAAGTAACTTTTCAAGATAACGCTTCTTATattacagggtctacttccattacaaatgcaataggtttgatggttaataa

The accession is bundled like so:

genome_accession - polyA_coordinate _ transcribed_orientation

The rest of the header indicates the number of transcripts supporting this polyA site followed by the list of those transcript accessions. The examples above were extracted from our sample data set provided. A more compelling example for Arabidopsis, using spliced transcripts only, is as follows:

>chr5-506542_- 44 transcripts: gi|86086725|gb|DR382484.1|DR382484,gi|86082384|gb|DR378143.1|DR378143,gi|86082270|gb|DR378029.1|DR378029,gi|86082193|gb|DR377952.1|DR37795
2,gi|86082172|gb|DR377931.1|DR377931,gi|86082156|gb|DR377915.1|DR377915,gi|86082123|gb|DR377882.1|DR377882,gi|86082071|gb|DR377830.1|DR377830,gi|86081971|gb|DR377730.
1|DR377730,gi|86081887|gb|DR377646.1|DR377646,gi|86081885|gb|DR377644.1|DR377644,gi|86081868|gb|DR377627.1|DR377627,gi|86081709|gb|DR377466.1|DR377466,gi|86081657|gb|
DR377414.1|DR377414,gi|86081635|gb|DR377392.1|DR377392,gi|86081559|gb|DR377316.1|DR377316,gi|86081550|gb|DR377307.1|DR377307,gi|86081543|gb|DR377300.1|DR377300,gi|860
81529|gb|DR377286.1|DR377286,gi|86081252|gb|DR377009.1|DR377009,gi|86081247|gb|DR377004.1|DR377004,gi|86081239|gb|DR376996.1|DR376996,gi|86079014|gb|DR374771.1|DR3747
71,gi|86076986|gb|DR372743.1|DR372743,gi|85870703|gb|DR191655.1|DR191655,gi|85869935|gb|DR190887.1|DR190887,gi|85869920|gb|DR190872.1|DR190872,gi|85869608|gb|DR190560
.1|DR190560,gi|85869452|gb|DR190404.1|DR190404,gi|85869353|gb|DR190305.1|DR190305,gi|85869352|gb|DR190304.1|DR190304,gi|85869340|gb|DR190292.1|DR190292,gi|85869337|gb
|DR190289.1|DR190289,gi|85869336|gb|DR190288.1|DR190288,gi|85869335|gb|DR190287.1|DR190287,gi|85869329|gb|DR190281.1|DR190281,gi|85868471|gb|DR189423.1|DR189423,gi|85
867798|gb|DR188750.1|DR188750,gi|85867058|gb|DR188010.1|DR188010,gi|49285508|gb|BP634256.1|BP634256,gi|32888810|gb|CB264037.1|CB264037,gi|32888295|gb|CB263522.1|CB263
522,gi|32885705|gb|CB260932.1|CB260932,gi|32885650|gb|CB260877.1|CB260877
GTTTTATCTTTGTGACTTTATTAATCCTAAGACTATTATGGGTTTGTATTaaagtttgcttctttcttgctcactacacaattaagattcaagcccattg
Note
Polyadenylation sites identified here require that there is evidence of polyadenylation in the original transcript sequence. Other systems examine clusters of transcript alignment termini within windows. This is not done here yet as part of PASA. Only those polyA sites supported by experimental evidence of polyadenylation are reported.

Identification and Classification of All Alternative Splicing Variations

PASA is a tool well suited to the identification and classification of alternative splicing isoforms as evidenced by incompatible transcript alignments. Overlapping alignments found incompatible in that they have some structural difference within their overlapping region, and due to their nature of incompatibility, they are relegated to different but overlapping alignment assemblies. PASA performs and all-vs-all comparison among the clustered overlapping alignment assemblies to identify the following categories of splicing variations:

The automated alternative splicing analysis can be run like so as exemplified from the sample_data directory:

% ../scripts/Launch_PASA_pipeline.pl -c alignAssembly.config -g genome_sample.fasta -t all_transcripts.fasta.clean --ALT_SPLICE

The results are available in the default output files: output.alt_splice_label_combinations.dat:: a tab-delimited listing that contains all unique splicing labels for each pasa alignment assembly labeled with a variation. For example:

genome  pasa_acc  assembly_cluster   combinations_of_labels
68711   asmbl_2 1       ends_in_intron
68711   asmbl_6 3       alt_donor
68711   asmbl_4 3       alt_donor
68711   asmbl_10        6       alt_acceptor, retained_exon, skipped_exon
68711   asmbl_11        6       alt_acceptor, retained_exon, skipped_exon
68711   asmbl_9 6       alt_acceptor, retained_exon, skipped_exon
68711   asmbl_24        14      spliced_intron, starts_in_intron
68711   asmbl_23        14      retained_intron
...
output.indiv_splice_labels_and_coords.dat

provides the genome coordinates for each alternative splicing label applied to each corresponding pasa alignment assembly. For example:

genome_acc  pasa_acc  assembly_cluster altsplice_label  genome_lend genome_rend transcribed_orient list_of_cdnas_supporting_variation
68711   asmbl_10        6       alt_acceptor    35633   35634   -       gi|42468094|emb|BX819464.1|CNS0A8YA
68711   asmbl_11        6       alt_acceptor    35639   35640   -       gi|6782248|emb|AJ271597.1|ATH271597
68711   asmbl_10        6       retained_exon   35448   35498   -       gi|42468094|emb|BX819464.1|CNS0A8YA,gi|42528978|gb|BX835128.1|BX835128
68711   asmbl_11        6       skipped_exon    35448   35498   -       gi|6782248|emb|AJ271597.1|ATH271597
68711   asmbl_10        6       retained_exon   36174   36227   -       gi|42468094|emb|BX819464.1|CNS0A8YA,gi|42532609|gb|BX838526.1|BX838526
68711   asmbl_11        6       skipped_exon    36174   36227   -       gi|6782248|emb|AJ271597.1|ATH271597
68711   asmbl_11        6       retained_exon   36268   36309   -       gi|6782248|emb|AJ271597.1|ATH271597
68711   asmbl_10        6       skipped_exon    36268   36309   -       gi|42468094|emb|BX819464.1|CNS0A8YA,gi|42532609|gb|BX838526.1|BX838526
68711   asmbl_11        6       retained_exon   36879   37028   -       gi|6782248|emb|AJ271597.1|ATH271597
68711   asmbl_10        6       skipped_exon    36879   37028   -       gi|42468094|emb|BX819464.1|CNS0A8YA,gi|42532609|gb|BX838526.1|BX838526
68711   asmbl_10        6       alt_acceptor    35633   35634   -       gi|42468094|emb|BX819464.1|CNS0A8YA
68711   asmbl_9 6       alt_acceptor    35639   35640   -       gi|11125656|emb|AJ294534.1|ATH294534,gi|13398925|emb|AJ276619.1|ATH276619
...

The PASA web portal provides numerous reports, graphs, and illustrations to navigate the results of the automated alternative splicing analysis.

Only Interested in Alignment Assembly?

In our current working directory, there's a file clusters_of_valid_alignments.txt that contains all the clusters of valid alignments in a simple text format like so:

// cluster: number
accession,transcribed_orientation,lend-rend,lend-rend,...
...

The transcribed orientation is +,-, or ?. The ? orientation should be used only for single-exon transcript alignments for which the orientation of transcription is ambiguous. By default, PASA assigns all single-exon transcripts that lack evidence of polyadenylation to the ambiguous transcribed orientation. Given this input file, we can demonstrate the pasa alignment assembler like so:

% ../scripts/pasa_alignment_assembler_textprocessor.pl < clusters_of_valid_alignments.txt

Each cluster of transcript alignments is assembled separately and the results are outputted to stdout with illustrations.

Example input

// cluster: 52
gi|14532493|gb|AY039871.1|,-,38468-38715,38808-39953
gi|14532527|gb|AY039888.1|,-,38468-38715,38808-39953
gi|18655376|gb|AY077666.1|,-,38846-39847
gi|19801675|gb|AV782885.1|AV782885,-,38468-38715,38808-39255
gi|19839856|gb|AV805871.1|AV805871,-,38478-38715,38808-38972
gi|19861773|gb|AV819822.1|AV819822,-,38496-38715,38808-39021
gi|19864228|gb|AV822195.1|AV822195,?,39309-39953
gi|21403701|gb|AY084991.1|,-,38331-38715,38912-39950
gi|32362537|gb|CB074156.1|CB074156,?,38866-39212
gi|42467384|emb|BX819813.1|CNS0A8I9,-,38509-38715,38808-39898
gi|42467462|emb|BX820042.1|CNS0A8GI,-,38481-38715,38808-39873
gi|42467544|emb|BX820309.1|CNS0A8LV,-,38509-38715,38808-39907
gi|42467850|emb|BX818822.1|CNS0A905,-,38506-38715,38808-39907
gi|42468073|emb|BX819411.1|CNS0A8VM,-,38495-38715,38912-39907
gi|42468257|emb|BX820772.1|CNS0A8PI,-,38434-38715,38808-39907
gi|49289224|gb|BP637972.1|BP637972,-,38427-38715,38808-38892
gi|56086876|gb|BP562044.2|BP562044,?,39467-39919
gi|58799838|gb|BP779059.1|BP779059,-,38468-38715,38912-39063
gi|59847772|gb|BP811693.1|BP811693,?,39525-39918
gi|59898821|gb|BP837850.1|BP837850,?,39540-39918
gi|86056909|gb|DR352666.1|DR352666,?,39578-39950
gi|86056910|gb|DR352667.1|DR352667,?,39681-39894
gi|86056911|gb|DR352668.1|DR352668,?,39496-39950
gi|86056912|gb|DR352669.1|DR352669,?,39454-39907
gi|86056913|gb|DR352670.1|DR352670,?,39507-39950
gi|86056914|gb|DR352671.1|DR352671,?,39437-39919
gi|86084686|gb|DR380445.1|DR380445,-,38331-38715,38912-39127
gi|8678774|gb|AV519247.1|AV519247,-,38401-38715,38808-38918
gi|8682044|gb|AV522517.1|AV522517,-,38486-38715,38912-39124
gi|8700432|gb|AV538676.1|AV538676,-,38506-38715,38912-39282

Corresponding Output

Individual Alignments: (30)
  0 -------------->      <---------------------------------------       (a+/s-)gi|21403701|gb|AY084991.1|
  1 -------------->      <--------      (a+/s-)gi|86084686|gb|DR380445.1|DR380445
  2    ----------->   <----     (a+/s-)gi|8678774|gb|AV519247.1|AV519247
  3     ---------->   <---      (a+/s-)gi|49289224|gb|BP637972.1|BP637972
  4     ---------->   <---------------------------------------- (a+/s-)gi|42468257|emb|BX820772.1|CNS0A8PI
  5      --------->   <------------------------------------------       (a+/s-)gi|14532493|gb|AY039871.1|
  6      --------->   <------------------------------------------       (a+/s-)gi|14532527|gb|AY039888.1|
  7      --------->   <---------------- (a+/s-)gi|19801675|gb|AV782885.1|AV782885
  8      --------->      <------        (a+/s-)gi|58799838|gb|BP779059.1|BP779059
  9      --------->   <------   (a+/s-)gi|19839856|gb|AV805871.1|AV805871
 10       -------->   <---------------------------------------  (a+/s-)gi|42467462|emb|BX820042.1|CNS0A8GI
 11       -------->      <--------      (a+/s-)gi|8682044|gb|AV522517.1|AV522517
 12       -------->      <------------------------------------- (a+/s-)gi|42468073|emb|BX819411.1|CNS0A8VM
 13       -------->   <-------- (a+/s-)gi|19861773|gb|AV819822.1|AV819822
 14       -------->   <---------------------------------------- (a+/s-)gi|42467850|emb|BX818822.1|CNS0A905
 15       -------->      <--------------        (a+/s-)gi|8700432|gb|AV538676.1|AV538676
 16        ------->   <---------------------------------------- (a+/s-)gi|42467384|emb|BX819813.1|CNS0A8I9
 17        ------->   <---------------------------------------- (a+/s-)gi|42467544|emb|BX820309.1|CNS0A8LV
 18                    --------------------------------------   (a+/s-)gi|18655376|gb|AY077666.1|
 19                     --------------  (a+/s?)gi|32362537|gb|CB074156.1|CB074156
 20                                     -------------------------       (a+/s?)gi|19864228|gb|AV822195.1|AV822195
 21                                          -------------------        (a+/s?)gi|86056914|gb|DR352671.1|DR352671
 22                                           ----------------- (a+/s?)gi|86056912|gb|DR352669.1|DR352669
 23                                           ------------------        (a+/s?)gi|56086876|gb|BP562044.2|BP562044
 24                                            ------------------       (a+/s?)gi|86056911|gb|DR352668.1|DR352668
 25                                             -----------------       (a+/s?)gi|86056913|gb|DR352670.1|DR352670
 26                                             ----------------        (a+/s?)gi|59847772|gb|BP811693.1|BP811693
 27                                              ---------------        (a+/s?)gi|59898821|gb|BP837850.1|BP837850
 28                                               ---------------       (a+/s?)gi|86056909|gb|DR352666.1|DR352666
 29                                                   --------- (a+/s?)gi|86056910|gb|DR352667.1|DR352667


ASSEMBLIES: (2)
       ----------->   <------------------------------------------       (a-/s-)gi|8678774|gb|AV519247.1|AV519247/gi|49289224|gb|BP637972.1|BP637972/gi|42468257|emb|BX820772.1|CNS0A8PI/gi|14532493|gb|AY039871.1|/gi|14532527|gb|AY039888.1|/gi|19801675|gb|AV782885.1|AV782885/gi|19839856|gb|AV805871.1|AV805871/gi|42467462|emb|BX820042.1|CNS0A8GI/gi|19861773|gb|AV819822.1|AV819822/gi|42467850|emb|BX818822.1|CNS0A905/gi|42467384|emb|BX819813.1|CNS0A8I9/gi|42467544|emb|BX820309.1|CNS0A8LV/gi|18655376|gb|AY077666.1|/gi|32362537|gb|CB074156.1|CB074156/gi|19864228|gb|AV822195.1|AV822195/gi|86056914|gb|DR352671.1|DR352671/gi|86056912|gb|DR352669.1|DR352669/gi|56086876|gb|BP562044.2|BP562044/gi|86056911|gb|DR352668.1|DR352668/gi|86056913|gb|DR352670.1|DR352670/gi|59847772|gb|BP811693.1|BP811693/gi|59898821|gb|BP837850.1|BP837850/gi|86056909|gb|DR352666.1|DR352666/gi|86056910|gb|DR352667.1|DR352667
    -------------->      <---------------------------------------       (a-/s-)gi|21403701|gb|AY084991.1|/gi|86084686|gb|DR380445.1|DR380445/gi|58799838|gb|BP779059.1|BP779059/gi|8682044|gb|AV522517.1|AV522517/gi|42468073|emb|BX819411.1|CNS0A8VM/gi|8700432|gb|AV538676.1|AV538676/gi|19864228|gb|AV822195.1|AV822195/gi|86056914|gb|DR352671.1|DR352671/gi|86056912|gb|DR352669.1|DR352669/gi|56086876|gb|BP562044.2|BP562044/gi|86056911|gb|DR352668.1|DR352668/gi|86056913|gb|DR352670.1|DR352670/gi|59847772|gb|BP811693.1|BP811693/gi|59898821|gb|BP837850.1|BP837850/gi|86056909|gb|DR352666.1|DR352666/gi|86056910|gb|DR352667.1|DR352667



Assembly(1): orient(a-/s-) align: 38401(1461)-38715(1147)>YY....XX<38808(1146)-39953(1)
Assembly(2): orient(a-/s-) align: 38331(1427)-38715(1043)>YY....XX<38912(1042)-39953(1)

Train ab-initio gene finding software using PASA long-ORFs

The PASA alignment assemblies can be used to automatically extract gene structures for protein coding genes to be used for training gene finding software. To do this, the longest ORF is found within each PASA alignment assembly, allowing for 5' and 3' partials. Those long ORFs encoding at least 300 amino acids (configurable) are reported. After running the PASA to assemble all transcript alignments as described above, you can run the following:

% $PASAHOME/scripts/pasa_asmbls_to_training_set.dbi -M "sample_db_pasa:localhost" -p "access:access" -g genome_sample.fasta

for example, from our sample_data directory, we would run

% $PASAHOME/scripts/pasa_asmbls_to_training_set.dbi -M "sample_db_pasa:localhost" -p "access:access" -g genome_sample.fasta

which generates two files: trainingSetCandidates.fasta and trainingSetCandidates.gff3 The fasta file contains the protein sequences corresponding to the longest ORF found in each relevant PASA alignment assembly. The gff3 file provides the corresponding gene structure in the genome. The header of each entry in the fasta file provides the following, by example:

% grep '>' trainingSetCandidates.fasta
>asmbl_1 contig:68711 coords:5453-8666 num_segs:3 type:complete len:244
>asmbl_10 contig:68711 coords:37495-33357 num_segs:13 type:complete len:483
>asmbl_11 contig:68711 coords:37495-33357 num_segs:13 type:complete len:514
>asmbl_12 contig:68711 coords:31440-33875 num_segs:9 type:complete len:258
>asmbl_13 contig:68711 coords:39953-38401 num_segs:2 type:complete len:333
>asmbl_14 contig:68711 coords:39953-38331 num_segs:2 type:complete len:318
>asmbl_26 contig:68711 coords:72555-74173 num_segs:1 type:5prime_partial len:456
>asmbl_27 contig:68711 coords:81430-76216 num_segs:16 type:5prime_partial len:1020
...

The accession corresponds to the PASA assembly. The contig and the coordinate span of the alignment assembly is provided, followed by the number of exons, and the length of the longest ORF. The type indicator can be any of the following: complete, 5prime_partial, 3prime_partial, or internal. The 5prime_partial are missing a start codon and translate to the very 5' end, 3prime_partial are missing a stop codon and translate to their very 3' end, and internal translate from the first to the last basepair in the sequence, missing a start and a stop codon. The complete category are of greatest interest for the prospects of genefinder training. Typically, we would search these complete proteins against the nonredundant protein database at GenBank and identify those ORFs that have good database matches across most of their length. Such entries can be confidently used for training, in addition to those particularly long ORFs that do not match known proteins and are sufficiently complex in sequence.

Reference

The PASA software and its original application are described in:

The use of PASA to analyze polyadenylation signals is described in:

Enhancements to PASA that automate the identifiation and classification of alternative splicing variations are described here:

Software Support

The PASA pipeline was an important component of the eukaryotic genome annotation at The Institute for Genomic Research. Since that time, PASA has been successfully used by genome sequencing and annotation groups across the globe. I (Brian Haas) developed and actively maintain the software. Please contact me directly at bhaas@broad.mit.edu if you should have any questions or require assistance.