Q: Once upon a time, I added up some results from you from ClinVar how many variants with no known pathogenicity can we update this query?
The ClinVar dataset comes as gzipped xml from:
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/ClinVarFullRelease_00-latest.xml.gz
and we happen to have a fresh copy in dipper/raw/clinvarxml_alpha/
If we needed freah copy we would issue
In [1]:
cd ../../raw/clinvarxml_alpha
# wget --timestamping ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/ClinVarFullRelease_00-latest.xml.gz
There are of course a number of ways to approach the question,
First what counts as variant?
"no known pathogenicity" is conceptually simple but the answer is supplied as freetext
First isolate the bits containing the answer.
In [7]:
# path shorteners
CVSET=/ReleaseSet/ClinVarSet
RCV=${CVSET}/ReferenceClinVarAssertion
SCV=${CVSET}/ClinVarAssertion
In [8]:
# do not try this at home
#xmlstarlet sel -t \
# -m ${SCV} -v ClinVarAccession/@Acc -o '|' -v ClinicalSignificance/Description -o '|' -n \
#ClinVarFullRelease_00-latest.xml.gz | sort > FULL_SCV_ClinicalSignificance.unl
#xmlstarlet sel -t \
# -m ${SCV} -v ClinVarAccession/@Acc -o "|" \
# -v ancestor::ClinVarSet/ReferenceClinVarAssertion/ClinVarAccession/@Acc -o "|" -n \
#ClinVarFullRelease_00-latest.xml.gz | sort > FULL_SCV_RCV.unl
#xmlstarlet sel -t
# -m ${RCV} -v ClinVarAccession/@Acc -o "|"
# -v 'MeasureSet/@ID' -o "|" -n \
#ClinVarFullRelease_00-latest.xml.gz | sort > FULL_RCV_VARID.unl
In [16]:
ls -l FULL_[RS]CV_*
see what is in the supplied freetext
In [18]:
cut -f2 -d\| FULL_SCV_ClinicalSignificance.unl | sort | uniq -c | sort -nr
We have a mapping of most of these in Clinvar translation table and should add the new ones
I am isolating them for this query.
In [24]:
cat patho_label.map
The last half dozen or so are SWAG,
maybe someone has better info on there classification.
Convert the freetext pathogenic call to its classification:
In [2]:
# head patho_label.map FULL_SCV_ClinicalSignificance.unl
tr -s '\t' < patho_label.map > patho_label.txt
tr '|' '\t' < FULL_SCV_ClinicalSignificance.unl > SCV_ClinicalSignificance.txt
head patho_label.txt SCV_ClinicalSignificance.txt
In [40]:
awk -F'\t' 'NR==FNR{p[$1]=$2}NR!=FNR{print $1,p[$2]}' \
patho_label.txt SCV_ClinicalSignificance.txt > SCV_class.txt
In [42]:
head SCV_class.txt
In [44]:
wc -l SCV_class.txt
grep -cv pathogenic SCV_class.txt
~75% of the records submitted to Clinvar (SCV) we would not classify as pathogenic.
In [47]:
tr '|' '\t' < FULL_SCV_RCV.unl > SCV_RCV.txt
In [26]:
join -j1 -o 2.2,1.2 1.3 SCV_class.txt SCV_RCV.txt | sort> RCV_class.txt
In [27]:
head RCV_class.txt
In [28]:
# How many RCV-call combinations
wc -l < RCV_class.txt
In [29]:
# How many distinct RCV-call combinations
sort -u RCV_class.txt | wc -l
In [46]:
echo "scale=3;(377430 / 406214* 100)" | bc
In [ ]:
About 7% of the calls Clinvar bundles together are perfectly redundant, or faithful reproductions.
In [2]:
cut -f1 -d ' ' RCV_class.txt | sort -u | wc -l
In [2]:
echo "scale=3;(377430 - 366567) | bc
In [3]:
echo "scale=3;(10863/377430 * 100)" | bc
This is saying that 11k or 2.8% of RCV have some variation in the pathogenic calls.
Note well this is any variation in our classification and not just signigificant disagreement.
In [49]:
grep -c pathogenic RCV_class.txt
In [81]:
grep pathogenic RCV_class.txt |cut -f1 | sort -u | wc -l
In [78]:
echo "scale=3;(94577 / 406214 * 100)" | bc
echo "scale=3;(82798 / 377430 * 100)" | bc
echo "scale=3;(82798 / 94577 * 100)" | bc
86.8% of all RCV have no pathogenic classification 88.1% of the unique RCV have no pathogenic classification
About 12.5% of the pathogenic call have some replication.
In [64]:
awk '$NF!="pathogenic"{p++}{c++}END{print p, c, p/c*100}' RCV_class.txt
(Without collapsing RCV this just recapiulates the SCV counts)
In [12]:
tr '|' '\t' < FULL_RCV_VARID.unl | sort > RCV_VARID.txt
In [69]:
join -j1 -o 1.2,2.2,2.3 RCV_VARID.txt RCV_class.txt |sort -n > VID_class.txt
In [71]:
head VID_class.txt
In [73]:
wc -l < VID_class.txt
In [77]:
cut -f1 -d ' ' VID_class.txt | sort -u | wc -l
In [79]:
grep -v pathogenic VID_class.txt | cut -f1 -d ' ' | sort -u | wc -l
In [80]:
echo "scale=3;(206300 / 268916 * 100)" | bc
In [82]:
sort -u VID_class.txt | awk '$NF=="pathogenic"{a[$1]=or(1,a[$1])}$NF=="benign"{a[$1]=or(2,a[$1])}END{for(v in a){if(a[v]>2)print v}}' | wc -l
Note these conflicts do not involve calls of uncertain significance
So far we have
( note the two 76.7% are via completly different pairs numbers more than 100k apart )
And here is where we are glad we did it this way.
I need to be counting "Variant of Uncertain Significance" (VUS) not the known pathogenic.
In [87]:
grep -vc significance SCV_class.txt
grep -v significance SCV_class.txt | cut -f1 |sort -u | wc -l
grep -v significance RCV_class.txt | cut -f1 |sort -u | wc -l
grep -v significance VID_class.txt | cut -f1 |sort -u | wc -l