Explore sample-level QC results

In this notebook we further explore the sample-level QC results generated by Sample-Level-QC.Rmd and specifically take a closer look at the outliers.

How to use this notebook:

RMarkdown was used to run a particular set of analyses and generate a canned report. If you have questions about the details of the analyses, see the canned reports for details. This notebook reads in those previously computed results, and re-renders 'small multiples' of the results.

With this notebook, specifically you can:

  • modify the plots further
  • add new plots
  • change the QC cutoff thresholds when examining outliers
  • use updated sample information
  • use more features from the sample information

It is currently configured to read in the QC results for the Simons Genome Diversity Project data. You can modify it to instead read in results for DeepVariant Platinum Genomes, 1000 Genomes, or results from your own cohort.

Setup


In [1]:
lapply(c('devtools', 'tidyverse', 'bigrquery', 'scales', 'skimr'),
       function(pkg_name) { if(! pkg_name %in% installed.packages()) { install.packages(pkg_name)} } )


  1. NULL
  2. NULL
  3. NULL
  4. NULL
  5. NULL

In [2]:
library(bigrquery)
library(scales)
library(skimr)
library(tidyverse)


Attaching package: ‘skimr’

The following object is masked from ‘package:stats’:

    filter

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
 ggplot2 3.2.1      purrr   0.3.2
 tibble  2.1.3      dplyr   0.8.3
 tidyr   0.8.3      stringr 1.4.0
 readr   1.3.1      forcats 0.4.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
 readr::col_factor() masks scales::col_factor()
 purrr::discard()    masks scales::discard()
 dplyr::filter()     masks skimr::filter(), stats::filter()
 dplyr::lag()        masks stats::lag()

In [3]:
#-----------[ CHANGE THIS] ------------------
BILLING_PROJECT_ID <- 'YOUR-PROJECT-ID'

In [4]:
AUTOSOMES <- c(paste('chr', c(as.character(seq(1, 22))), sep = ''),
               as.character(seq(1, 22)))
ALLOSOMES <- c('chrX', 'X', 'chrY', 'Y')
MITOCHONDRIA <- c('chrM', 'M', 'MT')

In [5]:
DEFAULT_PLOT_HEIGHT <- 7.5
options(repr.plot.width = 16, repr.plot.height = DEFAULT_PLOT_HEIGHT)
theme_set(theme_bw(base_size = 16))
update_geom_defaults('point', list(size = 2))

#' Returns a data frame with a y position and a label, for use annotating ggplot boxplots.
#'
#' @param d A data frame.
#' @return A data frame with column y as max and column label as length.
#'
get_boxplot_fun_data <- function(d) {
  return(data.frame(y = max(d), label = stringr::str_c('N = ', length(d))))
}

Retrieve sample information


In [6]:
raw_sample_info <- bq_table_download(bq_project_query(BILLING_PROJECT_ID, '
SELECT
  id_from_vcf,
  sex,
  region
FROM
  `bigquery-public-data.human_genome_variants.simons_genome_diversity_project_sample_attributes`
'))

print(skim(raw_sample_info))


Skim summary statistics
 n obs: 279 
 n variables: 3 

── Variable type:character ─────────────────────────────────────────────────────
    variable missing complete   n min max empty n_unique
 id_from_vcf       0      279 279   9  17     0      279
      region       0      279 279   6  18     0        7
         sex      31      248 279   1   6     0        3

Wrangle this into the columns we expect for the rest of the notebook.

Note that some cohorts have samples from several different sequencing platforms such as Illumina HiSeq2000 or Illumina HiSeqX or Complete Genomics, so many plots are faceted by sequencing platform.


In [7]:
sample_info <- raw_sample_info %>%
    mutate(sequencing_platform = 'Illumina') %>%
    select(name = id_from_vcf, sequencing_platform, sex, ancestry = region)

dim(sample_info)


  1. 279
  2. 4

Retrieve previously computed results


In [8]:
genome_results <- read_csv(pipe(
    'gsutil cat gs://genomics-public-data/simons-genome-diversity-project/reports/Simons_Genome_Diversity_Project_sample_results.csv'))
genome_results <- genome_results[,!grepl('X1', colnames(genome_results))]  # Drop row name column, if present.

print(skim(genome_results))


Warning message:
“Missing column names filled in: 'X1' [1]”Parsed with column specification:
cols(
  X1 = col_double(),
  name = col_character(),
  no_calls = col_double(),
  all_calls = col_double(),
  missingness_rate = col_double(),
  private_variant_count = col_double(),
  heterozygous_variant_count = col_double(),
  perct_het_alt_in_snvs = col_double(),
  perct_hom_alt_in_snvs = col_double(),
  hom_AA_count = col_double(),
  het_RA_count = col_double()
)
Skim summary statistics
 n obs: 279 
 n variables: 10 

── Variable type:character ─────────────────────────────────────────────────────
 variable missing complete   n min max empty n_unique
     name       0      279 279   9  17     0      279

── Variable type:numeric ───────────────────────────────────────────────────────
                   variable missing complete   n       mean         sd
                  all_calls       0      279 279 4202064.38 347527.07 
               het_RA_count       2      277 279   15557     20910.59 
 heterozygous_variant_count       2      277 279 1538217.88 240758.45 
               hom_AA_count       2      277 279   44883.25  15131.25 
           missingness_rate       0      279 279       0.13      0.015
                   no_calls       0      279 279  566696.65  78680.61 
      perct_het_alt_in_snvs       2      277 279       0.21      0.27 
      perct_hom_alt_in_snvs       2      277 279       0.79      0.27 
      private_variant_count       2      277 279   51690.26  39931.7  
          p0         p25         p50        p75       p100     hist
 3733085       4e+06     4088139     4144993    5364212    ▁▇▂▁▁▁▁▁
      19         442         650       36249      74738    ▇▁▁▂▂▁▁▁
 1062073     1404741     1517632     1580983    2165891    ▁▂▃▇▁▁▁▂
     869       32544       49337       52084      83274    ▁▁▅▂▇▂▁▁
       0           0.13        0.13        0.14       0.18 ▁▁▁▁▁▇▅▁
       0      527438      553443      591394.5   820082    ▁▁▁▁▁▇▂▁
       0.006       0.009       0.012       0.54       0.69 ▇▁▁▁▁▁▂▂
       0.31        0.46        0.99        0.99       0.99 ▂▂▁▁▁▁▁▇
   15650       31980       38384       45963     223053    ▇▂▁▁▁▁▁▁

In [9]:
chrom_results <- read_csv(pipe(
    'gsutil cat gs://genomics-public-data/simons-genome-diversity-project/reports/Simons_Genome_Diversity_Project_sample_reference_results.csv'))
chrom_results <- chrom_results[,!grepl('X1', colnames(chrom_results))]  # Drop row name column, if present.

print(skim(chrom_results))


Warning message:
“Missing column names filled in: 'X1' [1]”Parsed with column specification:
cols(
  X1 = col_double(),
  name = col_character(),
  reference_name = col_character(),
  number_of_calls = col_double(),
  HOM_ALT = col_double(),
  HAS_ALT = col_double(),
  N_SITES = col_double(),
  F = col_double(),
  transitions = col_double(),
  transversions = col_double(),
  titv = col_double()
)
Skim summary statistics
 n obs: 6814 
 n variables: 10 

── Variable type:character ─────────────────────────────────────────────────────
       variable missing complete    n min max empty n_unique
           name       0     6814 6814   9  17     0      277
 reference_name       0     6814 6814   1   2     0       25

── Variable type:numeric ───────────────────────────────────────────────────────
        variable missing complete    n      mean       sd   p0      p25
               F       0     6814 6814      0.46     0.16 0.3      0.37
         HAS_ALT       0     6814 6814 106852.15 61891.86 9    57743.25
         HOM_ALT       0     6814 6814  44321.13 25379.59 7    24817.75
         N_SITES       0     6814 6814 106852.15 61891.86 9    57743.25
 number_of_calls       0     6814 6814 106852.15 61891.86 9    57743.25
            titv      51     6763 6814      2.78     4.12 1.04     2.08
     transitions      51     6763 6814  73249.66 41619.31 8    41199.5 
   transversions      51     6763 6814  34408.05 19842.44 1    17584.5 
      p50       p75   p100     hist
     0.41      0.46      1 ▆▇▂▁▁▁▁▁
 1e+05    153598.25 286484 ▅▇▇▇▅▅▁▁
 45256.5   60788.75 130670 ▆▇▆▇▃▂▁▁
 1e+05    153598.25 286484 ▅▇▇▇▅▅▁▁
 1e+05    153598.25 286484 ▅▇▇▇▅▅▁▁
     2.14      2.21     62 ▇▁▁▁▁▁▁▁
 70037     1e+05    194438 ▅▇▇▇▅▅▁▁
 33603     49699     92046 ▅▇▇▇▅▆▁▁

In [10]:
joined_genome_results <- inner_join(genome_results, sample_info)

dim(joined_genome_results)


Joining, by = "name"
  1. 279
  2. 13

In [11]:
joined_chrom_results <- inner_join(chrom_results, sample_info)

dim(joined_chrom_results)


Joining, by = "name"
  1. 6814
  2. 13

Sample information


In [12]:
sample_info %>%
  group_by(ancestry, sex) %>%
  summarize(
    count = n(),
    proportion = n() / nrow(.)
  ) %>%
ggplot(aes(x = ancestry, y = count, fill = ancestry)) +
  geom_bar(stat = 'identity', position = 'dodge') +
  geom_text(aes(label = str_glue('N = {count} ({percent(proportion)})'))) +
  facet_wrap( ~ sex, ncol = 1) +
  theme(axis.text.x = element_text(angle = 50, hjust = 1)) +
  ggtitle('Sample count and proportion by ancestry and sex')



In [13]:
sample_info %>%
  group_by(sequencing_platform, sex) %>%
  summarize(
    count = n(),
    proportion = n() / nrow(.)
  ) %>%
ggplot(aes(x = sequencing_platform, y = count, fill = sequencing_platform)) +
  geom_bar(stat = 'identity', position = 'dodge') +
  geom_text(aes(label = str_glue('N = {count} ({percent(proportion)})'))) +
  coord_cartesian(clip = 'off') +
  facet_wrap(. ~ sex, ncol = 1) +
  theme(axis.text.x = element_text(angle = 50, hjust = 1)) +
  ggtitle('Sample count and proportion by sequencing platform and sex')


Variant call rate

Per-chromosome variant call rate


In [14]:
joined_chrom_results %>%
  filter(reference_name %in% c(AUTOSOMES, ALLOSOMES, MITOCHONDRIA)) %>%
  mutate(reference_name = parse_factor(reference_name, levels = c(AUTOSOMES, ALLOSOMES, MITOCHONDRIA))) %>%
ggplot(aes(y = number_of_calls, x = reference_name)) +
  geom_boxplot() +
  scale_y_continuous(labels = comma) +
  facet_wrap(. ~ sequencing_platform, ncol = 1, scales = 'free_y') +
  ylab('Number of Variant Calls') +
  xlab('Chromosome') +
  ggtitle('Box plot: Count of variant calls per genome by chromosome') +
  theme(axis.text.x = element_text(angle = 50, hjust = 1))


Display per-chromosome outliers

Compute per chromosome mean and sd call count


In [15]:
chrom_calls_summary <- joined_chrom_results %>%
    group_by(reference_name, sequencing_platform) %>%
    summarize(
        number_of_calls_mean = mean(number_of_calls),
        number_of_calls_sd = sd(number_of_calls),
    )

head(chrom_calls_summary)


A grouped_df: 6 × 4
reference_namesequencing_platformnumber_of_calls_meannumber_of_calls_sd
<chr><chr><dbl><dbl>
1 Illumina200422.6016661.118
10Illumina130481.6010789.056
11Illumina129456.7910403.734
12Illumina121714.3113744.301
13Illumina103062.03 6769.014
14Illumina 84736.74 7430.972

In [16]:
low_chrom_calls_sd_multiplier <- 3

(chrom_calls_outliers <- joined_chrom_results %>%
    inner_join(chrom_calls_summary) %>%
    filter(number_of_calls < number_of_calls_mean - (low_chrom_calls_sd_multiplier * number_of_calls_sd)) %>%
    arrange(reference_name, number_of_calls))


Joining, by = c("reference_name", "sequencing_platform")
A tibble: 9 × 15
namereference_namenumber_of_callsHOM_ALTHAS_ALTN_SITESFtransitionstransversionstitvsequencing_platformsexancestrynumber_of_calls_meannumber_of_calls_sd
<chr><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr><chr><dbl><dbl>
LP6005441-DNA_B0312 3369 1837 3369 33690.54527 2297 10722.142724IlluminafemaleOceania 121714.3113744.30
LP6005443-DNA_G0112 6440 3386 6440 64400.52578 4511 19292.338517IlluminafemaleEastAsia 121714.3113744.30
SS6004471 9 736192603973619736190.3537048294253251.906969Illuminamale Africa 109534.7510566.19
LP6005441-DNA_G09X 1826 1743 1826 18260.95455 1232 5942.074074IlluminaNA WestEurasia 61613.3816510.37
LP6005443-DNA_C01X 3101 1613 3101 31010.52015 2125 9762.177254Illuminamale EastAsia 61613.3816510.37
LP6005592-DNA_A02X 3393 1852 3393 33930.54583 2310 10832.132964Illuminamale WestEurasia 61613.3816510.37
LP6005442-DNA_C10X 5141 3438 5141 51410.66874 3477 16642.089543Illuminamale WestEurasia 61613.3816510.37
LP6005443-DNA_F04X 5431 4149 5431 54310.76395 3700 17312.137493Illuminamale CentralAsiaSiberia 61613.3816510.37
LP6005443-DNA_E06X 5450 3471 5450 54500.63688 3626 18241.987939Illuminamale Africa 61613.3816510.37

In [17]:
problems <- chrom_calls_outliers %>%
                 group_by(name) %>%
                 summarize(
                     fail = str_glue('per chrom variant count is at least {low_chrom_calls_sd_multiplier}',
                                     ' standard deviations below the platform mean for ',
                                     str_c(reference_name, collapse = ',')))

dim(problems)
head(problems)


  1. 9
  2. 2
A tibble: 6 × 2
namefail
<chr><glue>
LP6005441-DNA_B03per chrom variant count is at least 3 standard deviations below the platform mean for 12
LP6005441-DNA_G09per chrom variant count is at least 3 standard deviations below the platform mean for X
LP6005442-DNA_C10per chrom variant count is at least 3 standard deviations below the platform mean for X
LP6005443-DNA_C01per chrom variant count is at least 3 standard deviations below the platform mean for X
LP6005443-DNA_E06per chrom variant count is at least 3 standard deviations below the platform mean for X
LP6005443-DNA_F04per chrom variant count is at least 3 standard deviations below the platform mean for X

Per-genome variant call rate


In [18]:
genome_call_rate_summary <- joined_chrom_results %>%
  group_by(name, sex, ancestry) %>%
  summarize(n = sum(number_of_calls))

p <- genome_call_rate_summary %>%
  ggplot() +
  geom_point(aes(x = name, y = n, color = ancestry, shape = sex)) +
  scale_x_discrete(expand = c(0.05, 1)) +
  scale_y_continuous(labels = comma) +
  xlab('Sample') +
  ylab('Number of Calls') +
  ggtitle('Scatter Plot: Count of Calls Per Genome') +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
            panel.grid.major.x = element_blank())
if (nrow(genome_call_rate_summary) <= 20) {
  p + theme(axis.text.x = element_text(angle = 50, hjust = 1))
} else {
  p + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
            panel.grid.major.x = element_blank())
}


Warning message:
“Removed 31 rows containing missing values (geom_point).”

In [19]:
genome_call_rate_summary <- joined_chrom_results %>%
  group_by(name, sex, sequencing_platform) %>%
  summarize(n = sum(number_of_calls))

p <- genome_call_rate_summary %>%
  ggplot() +
  geom_point(aes(x = name, y = n, color = sequencing_platform, shape = sex)) +
  scale_x_discrete(expand = c(0.05, 1)) +
  scale_y_continuous(labels = comma) +
  xlab('Sample') +
  ylab('Number of Calls') +
  ggtitle('Scatter Plot: Count of Calls Per Genome') +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
            panel.grid.major.x = element_blank())
if (nrow(genome_call_rate_summary) <= 20) {
  p + theme(axis.text.x = element_text(angle = 50, hjust = 1))
} else {
  p + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
            panel.grid.major.x = element_blank())
}


Warning message:
“Removed 31 rows containing missing values (geom_point).”

Display per-genome outliers


In [20]:
genome_calls_summary <- genome_call_rate_summary %>%
    group_by(sequencing_platform) %>%
    summarize(
        genome_calls_mean = mean(n),
        genome_calls_sd = sd(n),
    )

head(genome_calls_summary)


A tibble: 1 × 3
sequencing_platformgenome_calls_meangenome_calls_sd
<chr><dbl><dbl>
Illumina2628486237032.8

In [21]:
low_genome_calls_sd_multiplier <- 3

(genome_calls_outliers <- genome_call_rate_summary %>%
    inner_join(genome_calls_summary) %>%
    filter(n < genome_calls_mean - (low_genome_calls_sd_multiplier * genome_calls_sd)) %>%
    arrange(n))


Joining, by = "sequencing_platform"
A grouped_df: 0 × 6
namesexsequencing_platformngenome_calls_meangenome_calls_sd
<chr><chr><chr><dbl><dbl><dbl>

In [22]:
problems <- rbind(
    tibble(name = unique(genome_calls_outliers$name),
           fail = str_glue('per genome variant count is at least {low_genome_calls_sd_multiplier}',
                           ' standard deviations below the platform mean')),
    problems)

dim(problems)
head(problems)


  1. 9
  2. 2
A tibble: 6 × 2
namefail
<chr><glue>
LP6005441-DNA_B03per chrom variant count is at least 3 standard deviations below the platform mean for 12
LP6005441-DNA_G09per chrom variant count is at least 3 standard deviations below the platform mean for X
LP6005442-DNA_C10per chrom variant count is at least 3 standard deviations below the platform mean for X
LP6005443-DNA_C01per chrom variant count is at least 3 standard deviations below the platform mean for X
LP6005443-DNA_E06per chrom variant count is at least 3 standard deviations below the platform mean for X
LP6005443-DNA_F04per chrom variant count is at least 3 standard deviations below the platform mean for X

Missingness rate


In [23]:
joined_genome_results %>%
ggplot(aes(y = missingness_rate, x = ancestry, fill = ancestry)) +
  geom_boxplot() +
  stat_summary(fun.data = get_boxplot_fun_data, geom = 'text', size = 3,
               position = position_dodge(width = 0.9), vjust = -0.5) +
  scale_y_continuous(labels = percent_format()) +
  facet_wrap(. ~ sequencing_platform, ncol = 1, scales = 'free_y') +
  ylab('Missingness Rate') +
  xlab('Ancestry') +
  ggtitle('Genome-Specific Missingness') +
  theme(axis.text.x = element_text(angle = 50, hjust = 1))



In [24]:
p <- joined_genome_results %>%
  ggplot(aes(x = name, y = missingness_rate, color = sex)) +
  geom_point() +
  scale_x_discrete(expand = c(0.05, 1)) +
  scale_y_continuous(limits = c(0, NA), labels = percent_format()) +
  facet_wrap(. ~ sequencing_platform, ncol = 1, scales = 'free') +
  xlab('Sample') +
  ylab('Missingness Rate') +
  ggtitle('Scatter Plot: Genome-Specific Missingness')
if (nrow(joined_genome_results) <= 20) {
  p + theme(axis.text.x = element_text(angle = 50, hjust = 1))
} else {
  p + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
            panel.grid.major.x = element_blank())
}


Display outliers


In [25]:
(missingness_summary <- joined_genome_results %>%
    group_by(sequencing_platform) %>%
    summarize(
        missingness_rate_mean = mean(missingness_rate),
        missingness_rate_sd = sd(missingness_rate),
    ))


A tibble: 1 × 3
sequencing_platformmissingness_rate_meanmissingness_rate_sd
<chr><dbl><dbl>
Illumina0.134840.01535446

In [26]:
low_missingness_sd_multiplier <- 3

(low_missingness_outliers <- joined_genome_results %>%
     inner_join(missingness_summary) %>%
     filter(
         missingness_rate < missingness_rate_mean - (low_missingness_sd_multiplier * missingness_rate_sd)
     ) %>%
     arrange(missingness_rate))


Joining, by = "sequencing_platform"
A tibble: 2 × 15
nameno_callsall_callsmissingness_rateprivate_variant_countheterozygous_variant_countperct_het_alt_in_snvsperct_hom_alt_in_snvshom_AA_counthet_RA_countsequencing_platformsexancestrymissingness_rate_meanmissingness_rate_sd
<chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr><chr><dbl><dbl>
LP6005443-DNA_F11039334690NANANANANANAIlluminafemaleAmerica 0.134840.01535446
LP6005443-DNA_H03040203240NANANANANANAIlluminafemaleCentralAsiaSiberia0.134840.01535446

In [27]:
high_missingness_sd_multiplier <- 3

(high_missingness_outliers <- joined_genome_results %>%
     inner_join(missingness_summary) %>%
     filter(missingness_rate > missingness_rate_mean + (high_missingness_sd_multiplier * missingness_rate_sd)) %>%
     arrange(desc(missingness_rate)))


Joining, by = "sequencing_platform"
A tibble: 1 × 15
nameno_callsall_callsmissingness_rateprivate_variant_countheterozygous_variant_countperct_het_alt_in_snvsperct_hom_alt_in_snvshom_AA_counthet_RA_countsequencing_platformsexancestrymissingness_rate_meanmissingness_rate_sd
<chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr><chr><dbl><dbl>
LP6005442-DNA_F0175111541137170.18258792679813460920.5990.4012488937170IlluminafemaleCentralAsiaSiberia0.134840.01535446

In [28]:
problems <- rbind(
    tibble(name = unique(high_missingness_outliers$name),
           fail = str_glue('per genome missingness rate is at least {high_missingness_sd_multiplier}',
                           ' standard deviations above the platform mean')),
    tibble(name = unique(low_missingness_outliers$name),
           fail = str_glue('per genome missingness rate is at least {low_missingness_sd_multiplier}',
                           ' standard deviations below the platform mean')),
    problems)

dim(problems)
head(problems)


  1. 12
  2. 2
A tibble: 6 × 2
namefail
<chr><glue>
LP6005442-DNA_F01per genome missingness rate is at least 3 standard deviations above the platform mean
LP6005443-DNA_F11per genome missingness rate is at least 3 standard deviations below the platform mean
LP6005443-DNA_H03per genome missingness rate is at least 3 standard deviations below the platform mean
LP6005441-DNA_B03per chrom variant count is at least 3 standard deviations below the platform mean for 12
LP6005441-DNA_G09per chrom variant count is at least 3 standard deviations below the platform mean for X
LP6005442-DNA_C10per chrom variant count is at least 3 standard deviations below the platform mean for X

Singleton rate


In [29]:
joined_genome_results %>%
  ggplot(aes(y = private_variant_count, x = ancestry, fill = ancestry)) +
  geom_boxplot() +
  stat_summary(fun.data = get_boxplot_fun_data, geom = 'text',
               position = position_dodge(width = 0.9), vjust = -0.5) +
  facet_wrap(. ~ sequencing_platform, ncol = 1, scales = 'free_y') +
  scale_y_continuous(labels = comma, expand = c(0.3, 0)) +
  ylab('Number of Singletons') +
  xlab('Ancestry') +
  ggtitle('Box plot: Count of singletons per genome by ancestry') +
  theme(axis.text.x = element_text(angle = 50, hjust = 1))


Warning message:
“Removed 2 rows containing non-finite values (stat_boxplot).”Warning message:
“Removed 2 rows containing non-finite values (stat_summary).”

In [30]:
p <- joined_genome_results %>%
  ggplot(aes(x = name, y = private_variant_count, color = sex)) +
  geom_point() +
  scale_x_discrete(expand = c(0.05, 1)) +
  scale_y_log10(labels = comma) +
  facet_wrap(. ~ sequencing_platform, ncol = 1, scales = 'free') +
  xlab('Sample') +
  ylab('Number of Singletons (log scale)') +
  ggtitle('Scatter Plot: Count of Singletons Per Genome')
if (nrow(joined_genome_results) <= 20) {
  p + theme(axis.text.x = element_text(angle = 50, hjust = 1))
} else {
  p + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
            panel.grid.major.x = element_blank())
}


Warning message:
“Removed 2 rows containing missing values (geom_point).”

Display outliers


In [31]:
(singleton_summary <- joined_genome_results %>%
     group_by(sequencing_platform, ancestry) %>%
     summarize(
         singleton_count_mean = mean(private_variant_count),
         singleton_count_sd = sd(private_variant_count),
     ))


A grouped_df: 7 × 4
sequencing_platformancestrysingleton_count_meansingleton_count_sd
<chr><chr><dbl><dbl>
IlluminaAfrica 129941.9147340.445
IlluminaAmerica NA NA
IlluminaCentralAsiaSiberia NA NA
IlluminaEastAsia 39639.74 4331.256
IlluminaOceania 47300.6818688.494
IlluminaSouthAsia 44108.92 5729.143
IlluminaWestEurasia 34190.81 4668.445

In [32]:
high_singleton_sd_multiplier <- 3

(high_singleton_outliers <- joined_genome_results %>%
     inner_join(singleton_summary) %>%
     filter(private_variant_count > singleton_count_mean + (high_singleton_sd_multiplier * singleton_count_sd)) %>%
     arrange(desc(private_variant_count)))


Joining, by = c("sequencing_platform", "ancestry")
A tibble: 1 × 15
nameno_callsall_callsmissingness_rateprivate_variant_countheterozygous_variant_countperct_het_alt_in_snvsperct_hom_alt_in_snvshom_AA_counthet_RA_countsequencing_platformsexancestrysingleton_count_meansingleton_count_sd
<chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr><chr><dbl><dbl>
SS600447855039541097290.133924910349013725110.510.493521936586IlluminafemaleOceania47300.6818688.49

In [33]:
problems <- rbind(
    tibble(name = high_singleton_outliers$name,
           fail = str_glue('singleton count at least {high_singleton_sd_multiplier}',
           ' standard deviations above the ancestry and platform mean')),
    problems)

dim(problems)
head(problems)


  1. 13
  2. 2
A tibble: 6 × 2
namefail
<chr><glue>
SS6004478 singleton count at least 3 standard deviations above the ancestry and platform mean
LP6005442-DNA_F01per genome missingness rate is at least 3 standard deviations above the platform mean
LP6005443-DNA_F11per genome missingness rate is at least 3 standard deviations below the platform mean
LP6005443-DNA_H03per genome missingness rate is at least 3 standard deviations below the platform mean
LP6005441-DNA_B03per chrom variant count is at least 3 standard deviations below the platform mean for 12
LP6005441-DNA_G09per chrom variant count is at least 3 standard deviations below the platform mean for X

Heterozygosity rate


In [34]:
ggplot(joined_genome_results,
       aes(y = heterozygous_variant_count, x = ancestry, fill = ancestry)) +
  geom_boxplot() +
  stat_summary(fun.data = get_boxplot_fun_data, geom = 'text',
               position = position_dodge(width = 0.9), vjust = -0.5) +
  scale_y_continuous(labels = comma, expand = c(0.3, 0)) +
  facet_wrap(. ~ sequencing_platform, ncol = 1, scales = 'free_y') +
  ylab('Number of Heterozyous Variants') +
  xlab('Ancestry') +
  ggtitle('Box plot: Count of heterozygous variants per genome by ancestry') +
  theme(axis.text.x = element_text(angle = 50, hjust = 1))


Warning message:
“Removed 2 rows containing non-finite values (stat_boxplot).”Warning message:
“Removed 2 rows containing non-finite values (stat_summary).”

In [35]:
p <- ggplot(joined_genome_results) +
  geom_point(aes(x = name, y = heterozygous_variant_count, color = sex)) +
  scale_x_discrete(expand = c(0.05, 1)) +
  scale_y_continuous(labels = comma) +
  facet_wrap(. ~ sequencing_platform, ncol = 1, scales = 'free') +
  xlab('Sample') +
  ylab('Number of Heterozygous Variants') +
  ggtitle('Scatter Plot: Count of Heterozygous Variants Per Genome')
if (nrow(joined_genome_results) <= 20) {
  p + theme(axis.text.x = element_text(angle = 50, hjust = 1))
} else {
  p + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
            panel.grid.major.x = element_blank())
}


Warning message:
“Removed 2 rows containing missing values (geom_point).”

Display outliers


In [36]:
(heterozygous_variant_count_summary <- joined_genome_results %>%
    group_by(sequencing_platform, ancestry) %>%
    summarize(
        heterozygous_variant_count_mean = mean(heterozygous_variant_count),
        heterozygous_variant_count_sd = sd(heterozygous_variant_count),
    ))


A grouped_df: 7 × 4
sequencing_platformancestryheterozygous_variant_count_meanheterozygous_variant_count_sd
<chr><chr><dbl><dbl>
IlluminaAfrica 2004011124492.59
IlluminaAmerica NA NA
IlluminaCentralAsiaSiberia NA NA
IlluminaEastAsia 1428578 58931.13
IlluminaOceania 1270962 98640.50
IlluminaSouthAsia 1538952 83020.18
IlluminaWestEurasia 1533601 52577.87

In [37]:
low_het_count_sd_multiplier <- 3

(low_heterozygous_variant_count_outliers <- joined_genome_results %>%
     inner_join(heterozygous_variant_count_summary) %>%
     filter(heterozygous_variant_count
            < heterozygous_variant_count_mean - (low_het_count_sd_multiplier * heterozygous_variant_count_sd)
     ) %>%
     arrange(heterozygous_variant_count))


Joining, by = c("sequencing_platform", "ancestry")
A tibble: 1 × 15
nameno_callsall_callsmissingness_rateprivate_variant_countheterozygous_variant_countperct_het_alt_in_snvsperct_hom_alt_in_snvshom_AA_counthet_RA_countsequencing_platformsexancestryheterozygous_variant_count_meanheterozygous_variant_count_sd
<chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr><chr><dbl><dbl>
LP6005443-DNA_D0951013438735890.13169543951411353030.0090.99153132472IlluminamaleSouthAsia153895283020.18

In [38]:
high_het_count_sd_multiplier <- 3

(high_heterozygous_variant_count_outliers <- joined_genome_results %>%
     inner_join(heterozygous_variant_count_summary) %>%
     filter(heterozygous_variant_count
            > heterozygous_variant_count_mean + (high_het_count_sd_multiplier * heterozygous_variant_count_sd)
     ) %>%
     arrange(desc(heterozygous_variant_count)))


Joining, by = c("sequencing_platform", "ancestry")
A tibble: 0 × 15
nameno_callsall_callsmissingness_rateprivate_variant_countheterozygous_variant_countperct_het_alt_in_snvsperct_hom_alt_in_snvshom_AA_counthet_RA_countsequencing_platformsexancestryheterozygous_variant_count_meanheterozygous_variant_count_sd
<chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr><chr><dbl><dbl>

In [39]:
problems <- rbind(
    tibble(name = low_heterozygous_variant_count_outliers$name,
           fail = str_glue('heterozygous variant count at least {low_het_count_sd_multiplier}',
                           ' standard deviations below the ancestry and platform mean')),
    tibble(name = low_heterozygous_variant_count_outliers$name,
           fail = str_glue('heterozygous variant count at least {high_het_count_sd_multiplier}',
                           ' standard deviations above the ancestry and platform mean')),
    problems)

dim(problems)
head(problems)


  1. 15
  2. 2
A tibble: 6 × 2
namefail
<chr><glue>
LP6005443-DNA_D09heterozygous variant count at least 3 standard deviations below the ancestry and platform mean
LP6005443-DNA_D09heterozygous variant count at least 3 standard deviations above the ancestry and platform mean
SS6004478 singleton count at least 3 standard deviations above the ancestry and platform mean
LP6005442-DNA_F01per genome missingness rate is at least 3 standard deviations above the platform mean
LP6005443-DNA_F11per genome missingness rate is at least 3 standard deviations below the platform mean
LP6005443-DNA_H03per genome missingness rate is at least 3 standard deviations below the platform mean

Homozygosity rate


In [40]:
joined_chrom_results %>%
  filter(reference_name %in% c(AUTOSOMES, ALLOSOMES, MITOCHONDRIA)) %>%
  mutate(reference_name = parse_factor(reference_name, levels = c(AUTOSOMES, ALLOSOMES, MITOCHONDRIA))) %>%
  ggplot(aes(y = F, x = reference_name, color = sex)) +
  geom_boxplot() +
  facet_grid(sex ~ .) +
  ylab('Fraction of Homozygous Variants') +
  xlab('Reference Name') +
  ggtitle('Fraction of Homozygous Variants Per Genome') +
  theme(axis.text.x = element_text(angle = 50, hjust = 1))



In [41]:
head(fraction_homozygous_summary <- joined_chrom_results %>%
    group_by(reference_name) %>%
    summarize(
        F_mean = mean(F),
        F_sd = sd(F),
    ))


A tibble: 6 × 3
reference_nameF_meanF_sd
<chr><dbl><dbl>
1 0.41884190.05076990
100.41119180.04964400
110.42215810.05315690
120.41583900.05427126
130.44042290.05769077
140.41161910.05125374

Display per-chromosome outliers


In [42]:
low_fraction_homozygous_sd_multiplier <- 4

(low_hom_outliers <- joined_chrom_results %>%
    inner_join(fraction_homozygous_summary) %>%
    filter(F < F_mean - (low_fraction_homozygous_sd_multiplier * F_sd)) %>%
    arrange(sex, reference_name, F))


Joining, by = "reference_name"
A tibble: 3 × 15
namereference_namenumber_of_callsHOM_ALTHAS_ALTN_SITESFtransitionstransversionstitvsequencing_platformsexancestryF_meanF_sd
<chr><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr><chr><dbl><dbl>
LP6005443-DNA_B06MT 10 7 10 100.70000 8 24.000000IlluminafemaleWestEurasia0.92222950.05258101
LP6005443-DNA_A06MT 17 12 17 170.70588 13 43.250000Illuminamale WestEurasia0.92222950.05258101
LP6005592-DNA_G03Y 9755609759750.574365624131.360775IlluminaNA WestEurasia0.94623400.05181015

In [43]:
high_fraction_homozygous_sd_multiplier <- 4

(high_hom_outliers <- joined_chrom_results %>%
    inner_join(fraction_homozygous_summary) %>%
    filter(F > F_mean + (high_fraction_homozygous_sd_multiplier * F_sd)) %>%
    arrange(sex, reference_name, desc(F)))


Joining, by = "reference_name"
A tibble: 18 × 15
namereference_namenumber_of_callsHOM_ALTHAS_ALTN_SITESFtransitionstransversionstitvsequencing_platformsexancestryF_meanF_sd
<chr><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr><chr><dbl><dbl>
LP6005441-DNA_B1210108474740931084741084740.6830574360341142.179750IlluminafemaleAmerica 0.41119180.04964400
LP6005441-DNA_A0412101085662931010851010850.6558169610314752.211597IlluminafemaleAmerica 0.41583900.05427126
LP6005441-DNA_A1213 8535463926 85354 853540.7489557955273992.115223IlluminafemaleAmerica 0.44042290.05769077
LP6005441-DNA_A0414 7096244091 70962 709620.6213348516224462.161454IlluminafemaleAmerica 0.41161910.05125374
LP6005441-DNA_F1215 5892841473 58928 589280.7037939970189582.108345IlluminafemaleEastAsia 0.41707240.05481585
LP6005441-DNA_A1219 3459624856 34596 345960.7184624207103892.330061IlluminafemaleAmerica 0.37473670.05114097
LP6005441-DNA_F1019 3429420158 34294 342940.5878023890104042.296232IlluminafemaleAmerica 0.37473670.05114097
LP6005443-DNA_H0221 2896923209 28969 289690.8011719708 92612.128064IlluminafemaleCentralAsiaSiberia0.40759640.04901320
LP6005441-DNA_A125 140029815621400291400290.5824794244457852.058403IlluminafemaleAmerica 0.38947730.04727087
LP6005441-DNA_H068 109655806531096551096550.7355272283373721.934149IlluminafemaleAmerica 0.40130460.05570589
LP6005441-DNA_D0115 6211343072 62113 621130.6934542291198222.133538Illuminamale SouthAsia 0.41707240.05481585
LP6005443-DNA_D0317 4490327478 44903 449030.6119431679132242.395569Illuminamale CentralAsiaSiberia0.39793380.04901049
LP6005441-DNA_G0618 6760142981 67601 676010.6358046322212792.176888Illuminamale America 0.42045470.05128500
LP6005441-DNA_A0519 3661023039 36610 366100.6293125803108072.387619Illuminamale WestEurasia 0.37473670.05114097
LP6005441-DNA_G0622 2338918427 23389 233890.7878516517 68722.403522Illuminamale America 0.38525610.06306327
LP6005442-DNA_F046 143386898911433861433860.6269298317450692.181477Illuminamale WestEurasia 0.39986430.05448162
LP6005519-DNA_B0514 7076844079 70768 707680.6228748466223022.173168IlluminaNA SouthAsia 0.41161910.05125374
LP6005519-DNA_C0320 4529026876 45290 452900.5934231559137312.298376IlluminaNA WestEurasia 0.38909940.05075676

In [44]:
problems <- rbind(
    low_hom_outliers %>%
    group_by(name) %>%
    summarize(fail = str_glue('per chrom homozygosity is at least {low_fraction_homozygous_sd_multiplier}',
                               ' standard deviations below the mean ',
                              str_c(reference_name, collapse = ','))),
    high_hom_outliers %>%
    group_by(name) %>%
    summarize(fail = str_glue('per chrom homozygosity is at least {high_fraction_homozygous_sd_multiplier}',
                               ' standard deviations above the mean ',
                              str_c(reference_name, collapse = ','))),
    problems)

dim(problems)
head(problems)


  1. 32
  2. 2
A tibble: 6 × 2
namefail
<chr><glue>
LP6005443-DNA_A06per chrom homozygosity is at least 4 standard deviations below the mean MT
LP6005443-DNA_B06per chrom homozygosity is at least 4 standard deviations below the mean MT
LP6005592-DNA_G03per chrom homozygosity is at least 4 standard deviations below the mean Y
LP6005441-DNA_A04per chrom homozygosity is at least 4 standard deviations above the mean 12,14
LP6005441-DNA_A05per chrom homozygosity is at least 4 standard deviations above the mean 19
LP6005441-DNA_A12per chrom homozygosity is at least 4 standard deviations above the mean 13,19,5

Ti/Tv ratio per chromosome


In [45]:
joined_chrom_results %>%
  filter(reference_name %in% c(AUTOSOMES)) %>%
  mutate(reference_name = parse_factor(reference_name, levels = c(AUTOSOMES))) %>%
  ggplot(aes(y = titv, x = reference_name, color = sex)) +
  geom_boxplot() +
  ylab('Ti/Tv ratio') +
  xlab('Chromosome') +
  ggtitle('Ti/Tv ratio per genome') +
  theme(axis.text.x = element_text(angle = 50, hjust = 1))


Display per-chromosome outliers


In [46]:
titv_summary <- joined_chrom_results %>%
    group_by(reference_name) %>%
    summarize(
        titv_mean = mean(titv),
        titv_sd = sd(titv),
    )

head(titv_summary)


A tibble: 6 × 3
reference_nametitv_meantitv_sd
<chr><dbl><dbl>
1 2.2192520.01258482
102.2028670.01241526
112.1117310.01504897
122.1965490.01564716
132.1379400.01309361
142.1570420.01730380

In [47]:
low_titv_sd_multiplier <- 4

(low_titv_outliers <- joined_chrom_results %>%
    inner_join(titv_summary) %>%
    filter(titv < titv_mean - (low_titv_sd_multiplier * titv_sd)) %>%
    arrange(reference_name, titv))


Joining, by = "reference_name"
A tibble: 7 × 15
namereference_namenumber_of_callsHOM_ALTHAS_ALTN_SITESFtransitionstransversionstitvsequencing_platformsexancestrytitv_meantitv_sd
<chr><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr><chr><dbl><dbl>
LP6005592-DNA_H032206100886762061002061000.43026138892672082.066599IlluminamaleOceania 2.1152570.011184957
LP6005592-DNA_H034186016834101860161860160.44840124896611202.043455IlluminamaleOceania 2.0837880.009898056
SS6004471 9 7361926039 73619 736190.35370 48294253251.906969IlluminamaleAfrica 2.0142150.016098735
LP6005441-DNA_G10X 5193449354 51934 519340.95032 33785181491.861535IlluminamaleWestEurasia2.0544050.035042037
LP6005441-DNA_A10X 5509552322 55095 550950.94967 35945191501.877023IlluminamaleOceania 2.0544050.035042037
LP6005442-DNA_D01X 5015648158 50156 501560.96016 32779173771.886344IlluminamaleEastAsia 2.0544050.035042037
LP6005592-DNA_H03X 5198949687 51989 519890.95572 34119178701.909289IlluminamaleOceania 2.0544050.035042037

In [48]:
high_titv_sd_multiplier <- 4

(high_titv_outliers <- joined_chrom_results %>%
    inner_join(titv_summary) %>%
    filter(titv > titv_mean + (high_titv_sd_multiplier * titv_sd)) %>%
    arrange(reference_name, desc(titv)))


Joining, by = "reference_name"
A tibble: 2 × 15
namereference_namenumber_of_callsHOM_ALTHAS_ALTN_SITESFtransitionstransversionstitvsequencing_platformsexancestrytitv_meantitv_sd
<chr><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr><chr><dbl><dbl>
LP6005443-DNA_G0112 6440 3386 6440 64400.52578 4511 19292.338517IlluminafemaleEastAsia 2.1965490.015647164
LP6005442-DNA_E044 181307691661813071813070.38149123433578742.132788Illuminamale WestEurasia2.0837880.009898056

In [49]:
problems <- rbind(
    low_titv_outliers %>%
    group_by(name) %>%
    summarize(fail = str_glue('per chrom ti/tv is at least {low_titv_sd_multiplier} standard deviations below the mean for ',
                              str_c(reference_name, collapse = ','))),
    high_titv_outliers %>%
    group_by(name) %>%
    summarize(fail = str_glue('per chrom ti/tv is at least {high_titv_sd_multiplier} standard deviations above the mean for ',
                              str_c(reference_name, collapse = ','))),
    problems)

dim(problems)
head(problems)


  1. 39
  2. 2
A tibble: 6 × 2
namefail
<chr><glue>
LP6005441-DNA_A10per chrom ti/tv is at least 4 standard deviations below the mean for X
LP6005441-DNA_G10per chrom ti/tv is at least 4 standard deviations below the mean for X
LP6005442-DNA_D01per chrom ti/tv is at least 4 standard deviations below the mean for X
LP6005592-DNA_H03per chrom ti/tv is at least 4 standard deviations below the mean for 2,4,X
SS6004471 per chrom ti/tv is at least 4 standard deviations below the mean for 9
LP6005442-DNA_E04per chrom ti/tv is at least 4 standard deviations above the mean for 4

Sex inference


In [50]:
ggplot(joined_genome_results, aes(x = sex, y = perct_het_alt_in_snvs, fill = sex)) +
  geom_boxplot() +
  stat_summary(fun.data = get_boxplot_fun_data, geom = 'text', position = position_dodge(width = 0.9), vjust = -0.5) +
  scale_y_continuous(labels = percent_format()) +
  xlab('Sex') +
  ylab('Heterozygosity Rate ') +
  ggtitle('Box Plot: Heterozygosity Rate on the X Chromosome')


Warning message:
“Removed 2 rows containing non-finite values (stat_boxplot).”Warning message:
“Removed 2 rows containing non-finite values (stat_summary).”

In [51]:
p <- ggplot(joined_genome_results) +
  geom_point(aes(x = name, y = perct_het_alt_in_snvs, color = sex)) +
  scale_x_discrete(expand = c(0.05, 1)) +
  scale_y_continuous(labels = percent_format()) +
  facet_wrap(. ~ sequencing_platform, ncol = 1, scales = 'free') +
  xlab('Sample') +
  ylab('Heterozygosity Rate ') +
  ggtitle('Scatter Plot: Heterozygosity Rate on the X Chromosome')
if (nrow(joined_genome_results) <= 20) {
  p + theme(axis.text.x = element_text(angle = 50, hjust = 1))
} else {
  p + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
            panel.grid.major.x = element_blank())
}


Warning message:
“Removed 2 rows containing missing values (geom_point).”

Display outliers


In [52]:
(m_outliers <- joined_genome_results %>%
    filter(!sex %in% c('F', 'FEMALE', 'female') & perct_het_alt_in_snvs > 0.4) %>%
    arrange(desc(perct_het_alt_in_snvs))
)


A tibble: 8 × 13
nameno_callsall_callsmissingness_rateprivate_variant_countheterozygous_variant_countperct_het_alt_in_snvsperct_hom_alt_in_snvshom_AA_counthet_RA_countsequencing_platformsexancestry
<chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr><chr>
LP6005592-DNA_B0150898040755980.1248847 2972515861280.6120.3882727543093IlluminaU WestEurasia
LP6005592-DNA_C0572361753642120.134897222213621246790.6070.3934838574738IlluminaNAAfrica
LP6005677-DNA_B0155062740862080.1347526 3322615443490.5860.4142895640960IlluminaNAWestEurasia
LP6005592-DNA_C0156863040523430.1403213 2572314969800.5500.4503024036956IlluminaU WestEurasia
LP6005519-DNA_E0657251740499020.1413656 3793213830800.5080.4923313434166IlluminaNAOceania
LP6005519-DNA_G0256616739494990.1433516 2580012674780.4660.5343648931894IlluminaNAAmerica
LP6005677-DNA_D0369048452380230.131821522305319520650.4640.5365926151298IlluminaNAAfrica
SS6004479 54236838871790.1395274 2637412435560.4570.5433505129543IlluminaNAAmerica

In [53]:
(f_outliers <- joined_genome_results %>%
    filter(! sex %in% c('M', 'MALE', 'male') & perct_het_alt_in_snvs < 0.4) %>%
    arrange(perct_het_alt_in_snvs))


A tibble: 29 × 13
nameno_callsall_callsmissingness_rateprivate_variant_countheterozygous_variant_countperct_het_alt_in_snvsperct_hom_alt_in_snvshom_AA_counthet_RA_countsequencing_platformsexancestry
<chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr><chr>
LP6005519-DNA_B0456692541429760.1368400 4931115472850.0060.99453098 330IlluminaNA SouthAsia
LP6005519-DNA_C0352757740200380.1312368 3317814162750.0060.99450180 319IlluminaNA WestEurasia
LP6005519-DNA_C0456034941211900.1359678 4671015149470.0060.99452994 312IlluminaNA SouthAsia
LP6005519-DNA_D0356151440842450.1374829 3358015311950.0060.99449785 303IlluminaNA WestEurasia
LP6005519-DNA_E0556887541370790.1375064 5792815080170.0060.99452651 337IlluminaNA SouthAsia
LP6005519-DNA_A0457519341561240.1383965 4784215573110.0070.99354123 376IlluminaNA SouthAsia
LP6005519-DNA_B0554831840964400.1338523 4528214689580.0070.99351976 350IlluminaNA SouthAsia
LP6005519-DNA_C0557067941445290.1376945 4976215423170.0070.99351461 348IlluminaNA SouthAsia
LP6005519-DNA_D0457967541487040.1397244 4718715381680.0070.99350708 348IlluminaNA SouthAsia
LP6005519-DNA_D0559339641751340.1421262 4853515187110.0070.99352873 353IlluminaNA SouthAsia
LP6005519-DNA_E0457965141636960.1392155 5113315440050.0070.99351263 354IlluminaNA SouthAsia
LP6005519-DNA_F0351954640985030.1267648 3573015667030.0070.99350415 367IlluminaNA WestEurasia
LP6005519-DNA_F0457436341697540.1377451 5088315477030.0070.99351346 347IlluminaNA SouthAsia
LP6005519-DNA_G0355213941744300.1322669 4353315961440.0070.99349942 366IlluminaNA SouthAsia
LP6005519-DNA_G0455344341495550.1333741 5006715535320.0070.99352347 365IlluminaNA SouthAsia
LP6005519-DNA_H0459048141560260.1420783 5109715438990.0070.99351791 361IlluminaNA SouthAsia
LP6005619-DNA_B0159230842686570.1387575 4794916398050.0070.99352749 366IlluminaNA Africa
LP6005619-DNA_C0162895943063180.1460549 4841016673440.0070.99352323 369IlluminaNA Africa
LP6005677-DNA_G0157616949307280.116852711387320704480.0070.99371097 500IlluminaNA Africa
LP6005519-DNA_A0552357541621250.1257951 5218315809830.0080.99253865 420IlluminaNA SouthAsia
LP6005519-DNA_H0355224541440000.1332638 4472915800600.0080.99251356 397IlluminaNA SouthAsia
LP6005592-DNA_D0449747039818470.1249345 3778213773580.0080.99249267 398IlluminaNA WestEurasia
LP6005592-DNA_D0153072040184250.1320716 2567714510090.0090.99147947 424IlluminaU WestEurasia
LP6005677-DNA_D0149159439580970.1241996 2901813148780.0120.98850880 599IlluminaNA America
LP6005441-DNA_G0951660041688450.1239192 4341316605410.0230.977 1664 39IlluminaNA WestEurasia
LP6005592-DNA_G0358942140656150.1449771 4413513911600.0390.96148043 1969IlluminaNA WestEurasia
LP6005441-DNA_B1247379537694800.1256924 1934610620730.2770.7234164315934IlluminafemaleAmerica
LP6005441-DNA_A1249160837967960.1294797 2391410928080.3070.6934083018073IlluminafemaleAmerica
LP6005441-DNA_F1057395138738780.1481593 2742911660500.3590.6413708620798IlluminafemaleAmerica

In [54]:
problems <- rbind(
    tibble(name = m_outliers$name, fail = str_glue('sample info indicates {m_outliers$sex}, genome indicates female')),
    tibble(name = f_outliers$name, fail = str_glue('sample info indicates {f_outliers$sex}, genome indicates male')),
    problems)

dim(problems)
head(problems)


  1. 76
  2. 2
A tibble: 6 × 2
namefail
<chr><glue>
LP6005592-DNA_B01sample info indicates U, genome indicates female
LP6005592-DNA_C05sample info indicates NA, genome indicates female
LP6005677-DNA_B01sample info indicates NA, genome indicates female
LP6005592-DNA_C01sample info indicates U, genome indicates female
LP6005519-DNA_E06sample info indicates NA, genome indicates female
LP6005519-DNA_G02sample info indicates NA, genome indicates female

Problem summary

Group all the 'problems' together by sample and emit a CSV report.


In [55]:
(problem_summary <- problems %>%
    group_by(name) %>%
    summarize(
        issues = str_c(fail, collapse = ';\n '),
        issue_count = n()
    ) %>%
    arrange(desc(issue_count), name))


A tibble: 66 × 3
nameissuesissue_count
<chr><chr><int>
LP6005441-DNA_A12sample info indicates female, genome indicates male; per chrom homozygosity is at least 4 standard deviations above the mean 13,19,5 2
LP6005441-DNA_B12sample info indicates female, genome indicates male; per chrom homozygosity is at least 4 standard deviations above the mean 10 2
LP6005441-DNA_F10sample info indicates female, genome indicates male; per chrom homozygosity is at least 4 standard deviations above the mean 19 2
LP6005441-DNA_G09sample info indicates NA, genome indicates male; per chrom variant count is at least 3 standard deviations below the platform mean for X 2
LP6005443-DNA_D09heterozygous variant count at least 3 standard deviations below the ancestry and platform mean; heterozygous variant count at least 3 standard deviations above the ancestry and platform mean2
LP6005443-DNA_G01per chrom ti/tv is at least 4 standard deviations above the mean for 12; per chrom variant count is at least 3 standard deviations below the platform mean for 12 2
LP6005519-DNA_B05sample info indicates NA, genome indicates male; per chrom homozygosity is at least 4 standard deviations above the mean 14 2
LP6005519-DNA_C03sample info indicates NA, genome indicates male; per chrom homozygosity is at least 4 standard deviations above the mean 20 2
LP6005592-DNA_G03sample info indicates NA, genome indicates male; per chrom homozygosity is at least 4 standard deviations below the mean Y 2
SS6004471 per chrom ti/tv is at least 4 standard deviations below the mean for 9; per chrom variant count is at least 3 standard deviations below the platform mean for 9 2
LP6005441-DNA_A04per chrom homozygosity is at least 4 standard deviations above the mean 12,14 1
LP6005441-DNA_A05per chrom homozygosity is at least 4 standard deviations above the mean 19 1
LP6005441-DNA_A10per chrom ti/tv is at least 4 standard deviations below the mean for X 1
LP6005441-DNA_B03per chrom variant count is at least 3 standard deviations below the platform mean for 12 1
LP6005441-DNA_D01per chrom homozygosity is at least 4 standard deviations above the mean 15 1
LP6005441-DNA_F12per chrom homozygosity is at least 4 standard deviations above the mean 15 1
LP6005441-DNA_G06per chrom homozygosity is at least 4 standard deviations above the mean 18,22 1
LP6005441-DNA_G10per chrom ti/tv is at least 4 standard deviations below the mean for X 1
LP6005441-DNA_H06per chrom homozygosity is at least 4 standard deviations above the mean 8 1
LP6005442-DNA_C10per chrom variant count is at least 3 standard deviations below the platform mean for X 1
LP6005442-DNA_D01per chrom ti/tv is at least 4 standard deviations below the mean for X 1
LP6005442-DNA_E04per chrom ti/tv is at least 4 standard deviations above the mean for 4 1
LP6005442-DNA_F01per genome missingness rate is at least 3 standard deviations above the platform mean 1
LP6005442-DNA_F04per chrom homozygosity is at least 4 standard deviations above the mean 6 1
LP6005443-DNA_A06per chrom homozygosity is at least 4 standard deviations below the mean MT 1
LP6005443-DNA_B06per chrom homozygosity is at least 4 standard deviations below the mean MT 1
LP6005443-DNA_C01per chrom variant count is at least 3 standard deviations below the platform mean for X 1
LP6005443-DNA_D03per chrom homozygosity is at least 4 standard deviations above the mean 17 1
LP6005443-DNA_E06per chrom variant count is at least 3 standard deviations below the platform mean for X 1
LP6005443-DNA_F04per chrom variant count is at least 3 standard deviations below the platform mean for X 1
LP6005519-DNA_C04sample info indicates NA, genome indicates male 1
LP6005519-DNA_C05sample info indicates NA, genome indicates male 1
LP6005519-DNA_D03sample info indicates NA, genome indicates male 1
LP6005519-DNA_D04sample info indicates NA, genome indicates male 1
LP6005519-DNA_D05sample info indicates NA, genome indicates male 1
LP6005519-DNA_E04sample info indicates NA, genome indicates male 1
LP6005519-DNA_E05sample info indicates NA, genome indicates male 1
LP6005519-DNA_E06sample info indicates NA, genome indicates female 1
LP6005519-DNA_F03sample info indicates NA, genome indicates male 1
LP6005519-DNA_F04sample info indicates NA, genome indicates male 1
LP6005519-DNA_G02sample info indicates NA, genome indicates female 1
LP6005519-DNA_G03sample info indicates NA, genome indicates male 1
LP6005519-DNA_G04sample info indicates NA, genome indicates male 1
LP6005519-DNA_H03sample info indicates NA, genome indicates male 1
LP6005519-DNA_H04sample info indicates NA, genome indicates male 1
LP6005592-DNA_A02per chrom variant count is at least 3 standard deviations below the platform mean for X1
LP6005592-DNA_B01sample info indicates U, genome indicates female 1
LP6005592-DNA_C01sample info indicates U, genome indicates female 1
LP6005592-DNA_C05sample info indicates NA, genome indicates female 1
LP6005592-DNA_D01sample info indicates U, genome indicates male 1
LP6005592-DNA_D04sample info indicates NA, genome indicates male 1
LP6005592-DNA_H03per chrom ti/tv is at least 4 standard deviations below the mean for 2,4,X 1
LP6005619-DNA_B01sample info indicates NA, genome indicates male 1
LP6005619-DNA_C01sample info indicates NA, genome indicates male 1
LP6005677-DNA_B01sample info indicates NA, genome indicates female 1
LP6005677-DNA_D01sample info indicates NA, genome indicates male 1
LP6005677-DNA_D03sample info indicates NA, genome indicates female 1
LP6005677-DNA_G01sample info indicates NA, genome indicates male 1
SS6004478 singleton count at least 3 standard deviations above the ancestry and platform mean 1
SS6004479 sample info indicates NA, genome indicates female 1

In [56]:
write_csv(problem_summary, path = 'problem_summary.csv')

Provenance


In [57]:
devtools::session_info()


─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.6.1 (2019-07-05)
 os       Debian GNU/Linux 9 (stretch)
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       Etc/UTC                     
 date     2019-12-03                  

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date       lib source        
 assertthat    0.2.1   2019-03-21 [2] CRAN (R 3.6.1)
 backports     1.1.4   2019-04-10 [2] CRAN (R 3.6.1)
 base64enc     0.1-3   2015-07-28 [2] CRAN (R 3.6.1)
 bigrquery   * 1.2.0   2019-07-02 [1] CRAN (R 3.6.1)
 bit           1.1-14  2018-05-29 [2] CRAN (R 3.6.1)
 bit64         0.9-7   2017-05-08 [2] CRAN (R 3.6.1)
 broom         0.5.2   2019-04-07 [1] CRAN (R 3.6.1)
 callr         3.3.2   2019-09-22 [1] CRAN (R 3.6.1)
 cellranger    1.1.0   2016-07-27 [1] CRAN (R 3.6.1)
 cli           1.1.0   2019-03-19 [2] CRAN (R 3.6.1)
 colorspace    1.4-1   2019-03-18 [1] CRAN (R 3.6.1)
 crayon        1.3.4   2017-09-16 [2] CRAN (R 3.6.1)
 curl          4.2     2019-09-24 [1] CRAN (R 3.6.1)
 DBI           1.0.0   2018-05-02 [2] CRAN (R 3.6.1)
 desc          1.2.0   2018-05-01 [1] CRAN (R 3.6.1)
 devtools      2.2.1   2019-09-24 [1] CRAN (R 3.6.1)
 digest        0.6.20  2019-07-04 [2] CRAN (R 3.6.1)
 dplyr       * 0.8.3   2019-07-04 [2] CRAN (R 3.6.1)
 ellipsis      0.3.0   2019-09-20 [1] CRAN (R 3.6.1)
 evaluate      0.14    2019-05-28 [2] CRAN (R 3.6.1)
 forcats     * 0.4.0   2019-02-17 [1] CRAN (R 3.6.1)
 fs            1.3.1   2019-05-06 [2] CRAN (R 3.6.1)
 gargle        0.3.1   2019-07-26 [1] CRAN (R 3.6.1)
 generics      0.0.2   2018-11-29 [1] CRAN (R 3.6.1)
 ggplot2     * 3.2.1   2019-08-10 [1] CRAN (R 3.6.1)
 glue          1.3.1   2019-03-12 [2] CRAN (R 3.6.1)
 gtable        0.3.0   2019-03-25 [1] CRAN (R 3.6.1)
 haven         2.1.1   2019-07-04 [1] CRAN (R 3.6.1)
 hms           0.5.1   2019-08-23 [2] CRAN (R 3.6.1)
 htmltools     0.3.6   2017-04-28 [2] CRAN (R 3.6.1)
 httr          1.4.1   2019-08-05 [1] CRAN (R 3.6.1)
 IRdisplay     0.7.0   2018-11-29 [2] CRAN (R 3.6.1)
 IRkernel      1.0.2   2019-07-12 [2] CRAN (R 3.6.1)
 jsonlite      1.6     2018-12-07 [2] CRAN (R 3.6.1)
 labeling      0.3     2014-08-23 [1] CRAN (R 3.6.1)
 lattice       0.20-38 2018-11-04 [4] CRAN (R 3.6.1)
 lazyeval      0.2.2   2019-03-15 [1] CRAN (R 3.6.1)
 lubridate     1.7.4   2018-04-11 [1] CRAN (R 3.6.1)
 magrittr      1.5     2014-11-22 [2] CRAN (R 3.6.1)
 memoise       1.1.0   2017-04-21 [2] CRAN (R 3.6.1)
 modelr        0.1.5   2019-08-08 [1] CRAN (R 3.6.1)
 munsell       0.5.0   2018-06-12 [1] CRAN (R 3.6.1)
 nlme          3.1-140 2019-05-12 [4] CRAN (R 3.6.1)
 pbdZMQ        0.3-3   2018-05-05 [2] CRAN (R 3.6.1)
 pillar        1.4.2   2019-06-29 [2] CRAN (R 3.6.1)
 pkgbuild      1.0.5   2019-08-26 [1] CRAN (R 3.6.1)
 pkgconfig     2.0.2   2018-08-16 [2] CRAN (R 3.6.1)
 pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.6.1)
 plyr          1.8.4   2016-06-08 [1] CRAN (R 3.6.1)
 prettyunits   1.0.2   2015-07-13 [2] CRAN (R 3.6.1)
 processx      3.4.1   2019-07-18 [2] CRAN (R 3.6.1)
 ps            1.3.0   2018-12-21 [2] CRAN (R 3.6.1)
 purrr       * 0.3.2   2019-03-15 [2] CRAN (R 3.6.1)
 R6            2.4.0   2019-02-14 [2] CRAN (R 3.6.1)
 Rcpp          1.0.2   2019-07-25 [2] CRAN (R 3.6.1)
 readr       * 1.3.1   2018-12-21 [1] CRAN (R 3.6.1)
 readxl        1.3.1   2019-03-13 [1] CRAN (R 3.6.1)
 remotes       2.1.0   2019-06-24 [1] CRAN (R 3.6.1)
 repr          1.0.1   2019-05-14 [2] CRAN (R 3.6.1)
 reshape2      1.4.3   2017-12-11 [1] CRAN (R 3.6.1)
 rlang         0.4.0   2019-06-25 [2] CRAN (R 3.6.1)
 rprojroot     1.3-2   2018-01-03 [2] CRAN (R 3.6.1)
 rstudioapi    0.10    2019-03-19 [2] CRAN (R 3.6.1)
 rvest         0.3.4   2019-05-15 [1] CRAN (R 3.6.1)
 scales      * 1.0.0   2018-08-09 [1] CRAN (R 3.6.1)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.1)
 skimr       * 1.0.7   2019-06-20 [1] CRAN (R 3.6.1)
 stringi       1.4.3   2019-03-12 [2] CRAN (R 3.6.1)
 stringr     * 1.4.0   2019-02-10 [1] CRAN (R 3.6.1)
 testthat      2.2.1   2019-07-25 [1] CRAN (R 3.6.1)
 tibble      * 2.1.3   2019-06-06 [2] CRAN (R 3.6.1)
 tidyr       * 0.8.3   2019-03-01 [2] CRAN (R 3.6.1)
 tidyselect    0.2.5   2018-10-11 [2] CRAN (R 3.6.1)
 tidyverse   * 1.2.1   2017-11-14 [1] CRAN (R 3.6.1)
 usethis       1.5.1   2019-07-04 [1] CRAN (R 3.6.1)
 uuid          0.1-2   2015-07-28 [2] CRAN (R 3.6.1)
 vctrs         0.2.0   2019-07-05 [2] CRAN (R 3.6.1)
 withr         2.1.2   2018-03-15 [2] CRAN (R 3.6.1)
 xml2          1.2.2   2019-08-09 [1] CRAN (R 3.6.1)
 zeallot       0.1.0   2018-01-28 [2] CRAN (R 3.6.1)

[1] /home/jupyter/.R/library
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library

Copyright 2019 Verily Life Sciences LLC. All rights reserved.

Licensed under the Apache License, Version 2.0 (the 'License'); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an 'AS IS' BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.