PaPI is a Machine Learning ensemble method to score human DNA mutations. It scores the functional effect of coding single nucleotide variants (SNVs) as well as coding deletions, insertions and indels (DIVs).
PaPI is based on three independent algorithms: Polyphen2, Sift and our
trained pseudo amino acid composition (PseAAC) model.
Two models of
PseAAC are available: random forest (RF) and logistic regression (LR).
Both models wrap the wild and mutated primary sequence into pseudo amino
acid composition, and take into consideration full-length primary sequence
attributes and evolutionary conservation scores (GERP++, Phylop and Siphy)
as well.
PaPI scores any coding variant: single nucleotide variants (SNVs) as
well as deletions, insertions and indels (DIVs).
Each mutation is scored as damaging (disease-related) or benign/neutral
(tolerated), by Polyphen, Sift and the aforementioned RF (or LR) models.
A voting scheme is applied to the three scores and a PaPI (ensemble)
score is extracted. The PaPI score reflects the probability to belong to
the damaging class, therefore is in the [0, 1] range. If the PaPI score is
0.5 or more, then mutation is labelled as damaging, otherwise as benign.
Due to their limits, Sift and Polyphen2 may not be able to score a mutation.
In fact, Sift and Polyphen2 can score SNVs only. Rarely, even some SNVs
may be left without a score by Polyphen2 and Sift, e.g. because of wrong
annotations, lack of data etc. On the other hand, RF and LR models are
always capable of scoring variants. Therefore, when Sift and Polyphen2
are unable to score a variant, the PaPI score is computed by the RF (or LR)
model prediction only.
PaPI web server allows two kind of analysis type: single and bulk. Single
mode allows to predict one variant at time. Bulk mode allows to upload a
file within mutations (one per line, see "File format example" below).
At the moment, the uploaded files are limited to a maximum of 1000 mutation,
and a file size of 2MB or less. We are going to overcame this limit in
near future.
For each submission mode it is possible to chose between two trained
classifier models: RF and LR. RF model is the most accurate, but is
slower than the LR. We recommend to use RF model to achieve the best
performances in terms of accuracy. By choosing the RF model the user
will have its results sent by email, even to predict a single variant.
PaPI actually works only with human reference genome hg19/GRCh37. It
is possible to choose between three different gene models (RefSeq,
Ensembl and GENCODE) in order to annotate the coding variants.
PaPI accepts as input single nucleotide variants, deletions, insertions
and generic indels. Multi-nucleotide variants are treated as indels.
Genomic coordinates are all 1-based (UCSC genome browser like) and nucleotide sequences are supposed to be on positive strand (+).
Theoretically, PaPI can score DIVs of any length. However, sequences
are explicitly coded by the classifier model considering 20-amino acids
snippets on both sides of the mutation in the primary sequence. (In case
of longer variations, the parts exceeding the snippets are implicitly
coded in other features anyway). For this reason PaPI web server only
accepts variants shorter than 20 amino acids (60 nucleotides).
The format for each kind of variant is reported below. Note that for
Bulk mode, file must have one mutation per line and no header. An example
of mutation input file is also reported.
(HGVS)
DHODH:c.595C>T
(VCF)
chr16 72055100 C T
(PaPI)
Chromosome Start Stop Reference Alteration
16 72055100 72055100 C T
(HGVS)
DHODH:c.595_596delCG
(VCF)
chr16 72055099 GCG G
(PaPI)
Chromosome Start Stop Reference Alteration
16 72055100 72055101 CG -
(HGVS)
DHODH:c.595_596insT
(VCF)
chr16 72055100 C CT
(PaPI)
Chromosome Start Stop Reference Alteration
16 72055100 72055101 - T
(HGVS)
DHODH:c.595_597delCGCinsAT
(VCF)
chr16 72055099 GCGC GAT
(PaPI)
Chromosome Start Stop Reference Alteration
16 72055100 72055102 CGC AT
The first 5 field must be the mutation descriptors as shown in
the examples above (Chromosome Start Stop Reference Alteration).
2 127447848 127447848 G T
7 30962205 30962205 C A
7 30951756 30951756 G -
7 30951664 30951664 A G
16 72055100 72055100 C T
PaPI output is a tab-delimited text file. Fields are explained below.
PAPI_CLASS : PaPI predicted class of mutation (DAMAGING | TOLERATED | NA=not available). Note that the not predicted coding variants (NA)
can occur only in case of frameshift/stop-causing variants at the first amino acids positions of a protein.
PAPI_SCORE : PaPI score for the variant (range: from 0 to 1). Higher is the score, higher is the probability to belong to DAMAGING class.
TRANSCRIPT : overlapping mRNA. Non coding RNA and ORF genes are not included.
GENE : gene name.
CHR : Mutated chromosome
START : genomic position where mutation begins
END : genomic position where mutation ends
REF : wild type (referring to hg19/GRCh37)
ALT : alteration
TYPE : type of mutation (snv | insertion | deletion | indel)
CODING_FUNCTION : functional class of variant (synonynmous | missense | stop-causing | stop-disrupting | inframe | frameshift)
AA_CHANGE : amino acidic change at mutated protein position. Because only the first amino acid relative to the genomic mutation is reported, for DIVs wild and mutated amino acid could be the same despite the downstream mutated amino acid sequence.
PHYLOP : Phylop Score.
GERP++ : Gerp++ Score
SIPHY: Siphy Score
POLYPHEN_SCORE : Original Polyhen2-HumVar score for the mutation.
POLYPHEN_CLASS : Polyphen2 class (B = benign | D = probably/possibly damaging | NA)
SIFT_SCORE : Original Sift score for the mutation.
SIFT_CLASS : Sift class (B = benign | D = damaging | NA)
PseAA_SCORE : Score of our PseAAC-aware trained classifier (RF or LR)
PseAA_CLASS : Relative class of variation (B = neutral | D = damaging | NA)
MESSAGE : eventual error/warning message.