What is PaPI

PaPI is a Machine Learning ensemble method to score human DNA mutations. It scores the functional effect of coding single nucleotide variants (SNVs) as well as coding deletions, insertions and indels (DIVs).

How it works

PaPI is based on three independent algorithms: Polyphen2, Sift and our trained pseudo amino acid composition (PseAAC) model.
Two models of PseAAC are available: random forest (RF) and logistic regression (LR). Both models wrap the wild and mutated primary sequence into pseudo amino acid composition, and take into consideration full-length primary sequence attributes and evolutionary conservation scores (GERP++, Phylop and Siphy) as well.


PaPI scores any coding variant: single nucleotide variants (SNVs) as well as deletions, insertions and indels (DIVs).
Each mutation is scored as damaging (disease-related) or benign/neutral (tolerated), by Polyphen, Sift and the aforementioned RF (or LR) models. A voting scheme is applied to the three scores and a PaPI (ensemble) score is extracted. The PaPI score reflects the probability to belong to the damaging class, therefore is in the [0, 1] range. If the PaPI score is 0.5 or more, then mutation is labelled as damaging, otherwise as benign.
Due to their limits, Sift and Polyphen2 may not be able to score a mutation. In fact, Sift and Polyphen2 can score SNVs only. Rarely, even some SNVs may be left without a score by Polyphen2 and Sift, e.g. because of wrong annotations, lack of data etc. On the other hand, RF and LR models are always capable of scoring variants. Therefore, when Sift and Polyphen2 are unable to score a variant, the PaPI score is computed by the RF (or LR) model prediction only.


PaPI web server allows two kind of analysis type: single and bulk. Single mode allows to predict one variant at time. Bulk mode allows to upload a file within mutations (one per line, see "File format example" below). At the moment, the uploaded files are limited to a maximum of  1000 mutation, and a file size of 2MB or less. We are going to overcame this limit in near future.
For each submission mode it is possible to chose between two trained classifier models: RF and LR. RF model is the most accurate, but is slower than the LR. We recommend to use RF model to achieve the best performances in terms of accuracy. By choosing the RF model the user will have its results sent by email, even to predict a single variant.
PaPI actually works only with human reference genome hg19/GRCh37. It is possible to choose between three different gene models (RefSeq, Ensembl and GENCODE) in order to annotate the coding variants.

Mutation format

PaPI accepts as input single nucleotide variants, deletions, insertions and generic indels. Multi-nucleotide variants are treated as indels. Genomic coordinates are all 1-based (UCSC genome browser like) and nucleotide sequences are supposed to be on positive strand (+).
Theoretically, PaPI can score DIVs of any length. However, sequences are explicitly coded by the classifier model considering 20-amino acids snippets on both sides of the mutation in the primary sequence. (In case of longer variations, the parts exceeding the snippets are implicitly coded in other features anyway). For this reason PaPI web server only accepts variants shorter than 20 amino acids (60 nucleotides).
The format for each kind of variant is reported below. Note that for Bulk mode, file must have one mutation per line and no header. An example of mutation input file is also reported.

SNVs format example

chr16 72055100 C T
Chromosome Start Stop Reference Alteration
16 72055100 72055100 C T

Deletion format example

chr16 72055099 GCG G
Chromosome Start Stop Reference Alteration
16 72055100 72055101 CG -

Insertion format example

chr16 72055100 C CT
Chromosome Start Stop Reference Alteration
16 72055100 72055101 - T

Indel format example

chr16 72055099 GCGC GAT
Chromosome Start Stop Reference Alteration
16 72055100 72055102 CGC AT

File format example (space/tab delimited)

The first 5 field must be the mutation descriptors as shown in the examples above (Chromosome Start Stop Reference Alteration).

2 127447848 127447848 G T
7 30962205 30962205 C A
7 30951756 30951756 G -
7 30951664 30951664 A G
16 72055100 72055100 C T

Output format

PaPI output is a tab-delimited text file. Fields are explained below.

PAPI_CLASS : PaPI predicted class of mutation (DAMAGING | TOLERATED | NA=not available). Note that the not predicted coding variants (NA) can occur only in case of frameshift/stop-causing variants at the first amino acids positions of a protein.
PAPI_SCORE : PaPI score for the variant (range: from 0 to 1). Higher is the score, higher is the probability to belong to DAMAGING class.
TRANSCRIPT : overlapping mRNA. Non coding RNA and ORF genes are not included.
GENE : gene name.
CHR : Mutated chromosome
START : genomic position where mutation begins
END : genomic position where mutation ends
REF : wild type (referring to hg19/GRCh37)
ALT : alteration
TYPE : type of mutation (snv | insertion | deletion | indel)
CODING_FUNCTION : functional class of variant (synonynmous | missense | stop-causing | stop-disrupting | inframe | frameshift)
AA_CHANGE : amino acidic change at mutated protein position. Because only the first amino acid relative to the genomic mutation is reported, for DIVs wild and mutated amino acid could be the same despite the downstream mutated amino acid sequence.
PHYLOP : Phylop Score.
GERP++ : Gerp++ Score
SIPHY: Siphy Score
POLYPHEN_SCORE : Original Polyhen2-HumVar score for the mutation.
POLYPHEN_CLASS : Polyphen2 class (B = benign | D = probably/possibly damaging | NA)
SIFT_SCORE : Original Sift score for the mutation.
SIFT_CLASS : Sift class (B = benign | D = damaging | NA)
PseAA_SCORE : Score of our PseAAC-aware trained classifier (RF or LR)
PseAA_CLASS : Relative class of variation (B = neutral | D = damaging | NA)
MESSAGE : eventual error/warning message.