As Illumina sequencing becoming more and more popular, there are various sequencing centers available for sequencing services. Here at UT-Austin, we have a renowned sequencing center (UT-Austin GSAF) that has been greatly helpful to my work. However, recently we have had good experience with MD Anderson Science Park Next-Generation Sequencing (NGS) Facility. So in this post, I am going to compare the sequencing price between these two centers for future reference.

Since the two sequencing centers have their sequencing prices posted online, I will do some web scraping and data cleaning to visualize the comparisons.



In [263]:

    
%matplotlib inline

import urllib.request
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sequencing_tools.viz_tools import okabeito_palette

Science park price table

For MD Anderson Science park price, I will download the data from their website.



In [264]:

    
mda = 'https://www.mdanderson.org/research/research-resources/core-facilities/next-generation-sequencing-core/services.html'
mda_html = urllib.request.urlopen(mda)
soup = BeautifulSoup(mda_html, 'lxml')
tabs = soup.find_all('table')
table = tabs[-1]
mda_df = pd.read_html(str(table), 
                      flavor='bs4',
                     header=0,
                     index_col = 0)[0]
mda_df









    Out[264]:







  
    
      
      NGS User Group Member Price (per lane)
      MDACC Faculty, UT System, BCM w/ MOU Price (per lane)
      *External Out-of-Network Price (per lane)
    
    
      Service
      
      
      
    
  
  
    
      HiSeq 3000
      NaN
      NaN
      NaN
    
    
      50 bp, single read
      $966.41
      $1,100.71
      $2,177.47
    
    
      75 bp, paired end
      $1,746.41
      $1,880.71
      $3,425.47
    
    
      100 bp, paired end
      $2,089.61
      $2,223.91
      $3,974.59
    
    
      150 bp, paired end
      $2,431.25
      $2,565.55
      $4,521.22
    
    
      NextSeq 500
      NaN
      NaN
      NaN
    
    
      High Output:
      NaN
      NaN
      NaN
    
    
      75 bp, single read
      $1,632.95
      $1,853.96
      $3,238.11
    
    
      75 bp, paired end
      $3,085.83
      $3,306.84
      $5,562.72
    
    
      150 bp, paired end
      $4,904.79
      $5,125.80
      $8,473.06
    
    
      Mid Output:
      NaN
      NaN
      NaN
    
    
      75 bp, paired end
      $1,232.55
      $1,453.56
      $2,597.47
    
    
      150 bp, paired end
      $1,941.83
      $2,162.84
      $3,732.32
    
    
      MiSeq
      NaN
      NaN
      NaN
    
    
      50 bp, single read (v2 chemistry)
      $948.99
      $1,132.53
      $1,358.78
    
    
      150 bp, paired end (v2 chemistry)
      $1,202.19
      $1,481.73
      $1,707.98
    
    
      300 bp, single read (v2 chemistry)
      $1,202.19
      $1,481.73
      $1,707.98
    
    
      300 bp, paired end (v3 chemistry)
      $1,782.99
      $2,177.73
      $2,403.98

Now, I need to clean up the table to make it more easily to manipulate. I will need a function to assign the platform (sequencing machines) for each run type (X bp, paired|single end). Then, I will only look at the member price, since that's what we pay for.



In [265]:

    
def clean_mda_index(idx, data_col):
    platform = ''
    clear = False
    new_idx = []
    for i, dc in zip(idx,data_col):
        if pd.isnull(dc):
            if clear:
                platform = ''
                clear=False
            platform = platform + i + '_'
            new_idx.append(platform)
        else:
            new_idx.append(platform + i)
            clear = True
    return new_idx



In [266]:

    
mda_price_df = mda_df \
    .assign(seq_type = lambda d: clean_mda_index(d.index, d.iloc[:,0])) \
    .assign(seq_type = lambda d: np.where(d.seq_type.str.contains('Mid Output'),
                                         'NextSeq 500_' + d.seq_type,
                                         d.seq_type))\
    .reset_index() \
    .drop('Service', axis=1) \
    .pipe(lambda d: d[~pd.isnull(d.iloc[:,1])]) \
    .assign(platform = lambda d: d.seq_type.str.split('_', expand=True).iloc[:,0]) \
    .assign(basepair = lambda d: d.seq_type.str.extract('_([0-9]+)', expand=False).astype(int)) \
    .assign(ends = lambda d: d.seq_type.str.extract('(single|paired)', expand=False)) \
    .pipe(lambda d: d[~d.seq_type.str.contains('Mid')]) \
    .assign(md_price = lambda d: d.iloc[:,0].str.replace('[,$]','').astype(float))\
    .assign(machine = lambda d: d.platform.str.replace(' [0-9]+','')) \
    .assign(total_base = lambda d: np.where(d.ends == "paired", d.basepair * 2, d.basepair))\
    .assign(ends = lambda d: np.where(d.machine=="MiSeq", 'single', d.ends))\
    .filter(regex = 'platform|total|ends|md_price|machine')  \
    .drop_duplicates()
mda_price_df









    Out[266]:







  
    
      
      platform
      ends
      md_price
      machine
      total_base
    
  
  
    
      1
      HiSeq 3000
      single
      966.41
      HiSeq
      50
    
    
      2
      HiSeq 3000
      paired
      1746.41
      HiSeq
      150
    
    
      3
      HiSeq 3000
      paired
      2089.61
      HiSeq
      200
    
    
      4
      HiSeq 3000
      paired
      2431.25
      HiSeq
      300
    
    
      7
      NextSeq 500
      single
      1632.95
      NextSeq
      75
    
    
      8
      NextSeq 500
      paired
      3085.83
      NextSeq
      150
    
    
      9
      NextSeq 500
      paired
      4904.79
      NextSeq
      300
    
    
      14
      MiSeq
      single
      948.99
      MiSeq
      50
    
    
      15
      MiSeq
      single
      1202.19
      MiSeq
      300
    
    
      17
      MiSeq
      single
      1782.99
      MiSeq
      600

UT GSAF price table

For UT GSAF price, we will download the table from UT GSAF website.



In [267]:

    
gsaf = 'https://wikis.utexas.edu/display/GSAF/Library+Prep+and+NGS+Pricing+Descriptions'
gsaf_html = urllib.request.urlopen(gsaf)
soup = BeautifulSoup(gsaf_html, 'lxml')
tabs = soup.find_all('table')
gsaf_df = pd.read_html(tabs[0].prettify(), 
             flavor='bs4',
             index_col = None,
            header=0)[0] 
gsaf_df









    Out[267]:







  
    
      
      Platform
      Run Type
      Internal / UT
      External Academic
      External Commercial
    
  
  
    
      0
      HiSeq 2500
      SR 50
      $1052
      $1331
      $1357
    
    
      1
      HiSeq 2500
      SR 100
      $1428
      $1806
      $1842
    
    
      2
      HiSeq 2500
      PE 125
      $2520
      $3187
      $3250
    
    
      3
      HiSeq 4000
      SR 50  (1)
      $ 1,043
      $ 1,319
      $1,345
    
    
      4
      HiSeq 4000
      PE 150  (1)
      $ 2,562
      $ 3,239
      $ 3,266
    
    
      5
      HiSeq 4000
      SR 50
      $ 1,697
      $ 2,146
      $ 2,173
    
    
      6
      HiSeq 4000
      PE 75
      $ 14,494
      $ 18,327
      $ 18,531
    
    
      7
      NextSeq 500
      SR 75 H.O  (1,3)
      $ 2,302
      $ 2,906
      $ 3,060
    
    
      8
      NextSeq 500
      PE 75 H.O  . (1,3)
      $ 3,826
      $ 4,834
      $ 4,988
    
    
      9
      NextSeq 500
      PE 150 H.O.
      $ 5,735
      $ 7,250
      $ 7,404
    
    
      10
      MiSeq
      V2 - 300 cycles
      $ 1,627
      $ 2,054
      $ 2,157
    
    
      11
      MiSeq
      V2 - 500 cycles  (1)
      $ 1,771
      $ 2,236
      $ 2,339
    
    
      12
      MiSeq
      V3 - 150 cycles
      $ 1,463
      $ 1,847
      $ 1,950
    
    
      13
      MiSeq
      V3 - 600 cycles  (1)
      $ 2,183
      $ 2,758
      $ 2,860

The resulting table is much more easy to be interpreted, since it is in a clean format. Again, I will only look at the Internal price.



In [268]:

    
gsaf_price_df = gsaf_df\
    .iloc[:, :3] \
    .assign(machine = lambda d: d.Platform.str.replace(' [0-9]+','')) \
    .assign(basepair = lambda d: d['Run Type'].str.extract(' ([0-9]+)', expand=False).astype(int)) \
    .assign(ends = lambda d: np.where(d['Run Type'].str.contains('PE'), 'paired', 'single')) \
    .assign(total_base = lambda d: np.where(d.ends == "paired", 
                                      d.basepair * 2, 
                                      d.basepair))\
    .assign(gsaf_price = lambda d: d['Internal / UT'].str.replace('[$.,]','').astype(float)) \
    .filter(regex = 'total|price|machine|ends|Platform')
gsaf_price_df









    Out[268]:







  
    
      
      Platform
      machine
      ends
      total_base
      gsaf_price
    
  
  
    
      0
      HiSeq 2500
      HiSeq
      single
      50
      1052.0
    
    
      1
      HiSeq 2500
      HiSeq
      single
      100
      1428.0
    
    
      2
      HiSeq 2500
      HiSeq
      paired
      250
      2520.0
    
    
      3
      HiSeq 4000
      HiSeq
      single
      50
      1043.0
    
    
      4
      HiSeq 4000
      HiSeq
      paired
      300
      2562.0
    
    
      5
      HiSeq 4000
      HiSeq
      single
      50
      1697.0
    
    
      6
      HiSeq 4000
      HiSeq
      paired
      150
      14494.0
    
    
      7
      NextSeq 500
      NextSeq
      single
      75
      2302.0
    
    
      8
      NextSeq 500
      NextSeq
      paired
      150
      3826.0
    
    
      9
      NextSeq 500
      NextSeq
      paired
      300
      5735.0
    
    
      10
      MiSeq
      MiSeq
      single
      300
      1627.0
    
    
      11
      MiSeq
      MiSeq
      single
      500
      1771.0
    
    
      12
      MiSeq
      MiSeq
      single
      150
      1463.0
    
    
      13
      MiSeq
      MiSeq
      single
      600
      2183.0

I notice there's duplicate of HiSeq 4000 single end 50-nt price (1043 USD vs 1697.0 USD), so I will take the lowest price, and merge with the science park price.



In [269]:

    
merge_df = gsaf_price_df\
    .groupby(['Platform','machine','ends','total_base'], as_index=False)\
    .agg({'gsaf_price':np.min})\
    .merge(mda_price_df, how ='outer')
merge_df









    Out[269]:







  
    
      
      Platform
      machine
      ends
      total_base
      gsaf_price
      platform
      md_price
    
  
  
    
      0
      HiSeq 2500
      HiSeq
      paired
      250
      2520.0
      NaN
      NaN
    
    
      1
      HiSeq 2500
      HiSeq
      single
      50
      1052.0
      HiSeq 3000
      966.41
    
    
      2
      HiSeq 4000
      HiSeq
      single
      50
      1043.0
      HiSeq 3000
      966.41
    
    
      3
      HiSeq 2500
      HiSeq
      single
      100
      1428.0
      NaN
      NaN
    
    
      4
      HiSeq 4000
      HiSeq
      paired
      150
      14494.0
      HiSeq 3000
      1746.41
    
    
      5
      HiSeq 4000
      HiSeq
      paired
      300
      2562.0
      HiSeq 3000
      2431.25
    
    
      6
      MiSeq
      MiSeq
      single
      150
      1463.0
      NaN
      NaN
    
    
      7
      MiSeq
      MiSeq
      single
      300
      1627.0
      MiSeq
      1202.19
    
    
      8
      MiSeq
      MiSeq
      single
      500
      1771.0
      NaN
      NaN
    
    
      9
      MiSeq
      MiSeq
      single
      600
      2183.0
      MiSeq
      1782.99
    
    
      10
      NextSeq 500
      NextSeq
      paired
      150
      3826.0
      NextSeq 500
      3085.83
    
    
      11
      NextSeq 500
      NextSeq
      paired
      300
      5735.0
      NextSeq 500
      4904.79
    
    
      12
      NextSeq 500
      NextSeq
      single
      75
      2302.0
      NextSeq 500
      1632.95
    
    
      13
      NaN
      HiSeq
      paired
      200
      NaN
      HiSeq 3000
      2089.61
    
    
      14
      NaN
      MiSeq
      single
      50
      NaN
      MiSeq
      948.99

Let's see if prices from both centers are correlated.



In [270]:

    
plt.rc('xtick', labelsize=15)
plt.rc('ytick', labelsize=15)
plt.rc('axes',labelsize=15)
p = sns.FacetGrid(data = merge_df,
             hue = 'machine',
             size = 5)
p.map(plt.scatter, 'md_price','gsaf_price')
p.add_legend(title = '')
p.set(xlabel = 'MD Anderson Science park\nsequencing cost (USD)',
     ylabel = 'UT-Austin GSAF\nsequencing cost (USD)')

# label outlier
for i, row in merge_df.iterrows():
    if row['gsaf_price'] > 10000:
        label = row['machine'] + ' ' + str(row['total_base']) + ' bp'
        p.fig.axes[0].text(row['md_price'] + 200,
                           row['gsaf_price'],
                           label)

p.fig.axes[0].plot(range(5000), color = 'black')









    Out[270]:





[<matplotlib.lines.Line2D at 0x11cd434a8>]

We can see that sequencing prices from the both centers are proportion to each other, with NextSeq runs tend to be of higher costs.

There is an HiSeq run being an outlier in the plot (~14000 USD at UT-Austin GSAF vs ~2000 USD at MDA Smithville), it is possibly mis-labeled of a full run (8 lanes for HiSeq) price instead of a per-lane-price. For the next analysis, I will assume that's the case.

I am interested at the head-to-head comparisons between the two sequencing centers for different sequencing run types,



In [271]:

    
merge_df = gsaf_price_df\
    .groupby(['Platform','machine','ends','total_base'], as_index=False)\
    .agg({'gsaf_price':np.min})\
    .merge(mda_price_df, how ='inner') \
    .filter(regex = 'price|machine|ends|base')\
    .pipe(pd.melt, id_vars = ['machine','ends','total_base'],
                 var_name = 'center',
                 value_name = 'price')\
    .assign(center = lambda d: np.where(d['center'].str.contains('gsaf'),
                                        'UT-Austin GSAF',
                                        'MD Anderson Science park')) \
    .assign(price = lambda d: np.where(d.price > 10000, d.price / 8, d.price))

p = sns.FacetGrid(data = merge_df,
             col = 'machine',
             sharex=False,
             size = 5)
p.map(sns.barplot, 'total_base',
                  'price',
                  'center',
                  palette = okabeito_palette(),
                  hue_order = ['MD Anderson Science park',
                              'UT-Austin GSAF'])
p.add_legend(bbox_to_anchor=(0.5,0.8),
            fontsize=12)
p.set(xlabel = '',
     ylabel = '')
p.set_titles('{col_name}')
p.fig.text(0.4, 0, 'Sequencing cycle (nt)', fontsize = 15)
p.fig.text(0, 0.8, 'Per-lane-price (USD)', fontsize = 15, rotation=90)









    



/Users/wckdouglas/miniconda3/lib/python3.6/site-packages/seaborn/axisgrid.py:703: UserWarning: Using the barplot function without specifying `order` is likely to produce an incorrect plot.
  warnings.warn(warning)






    Out[271]:





Text(0,0.8,'Per-lane-price (USD)')

So it looks like that UT Austin GSAF has a higher price for every sequencing type.

For the comparison in HiSeq runs, GSAF is running HiSeq 4000 while MD Anderson Science Park is running HiSeq 3000, so that maybe one of the reason. For the other ones, seems like the machines are the same.

	NGS User Group Member Price (per lane)	MDACC Faculty, UT System, BCM w/ MOU Price (per lane)	*External Out-of-Network Price (per lane)
Service
HiSeq 3000	NaN	NaN	NaN
50 bp, single read	$966.41	$1,100.71	$2,177.47
75 bp, paired end	$1,746.41	$1,880.71	$3,425.47
100 bp, paired end	$2,089.61	$2,223.91	$3,974.59
150 bp, paired end	$2,431.25	$2,565.55	$4,521.22
NextSeq 500	NaN	NaN	NaN
High Output:	NaN	NaN	NaN
75 bp, single read	$1,632.95	$1,853.96	$3,238.11
75 bp, paired end	$3,085.83	$3,306.84	$5,562.72
150 bp, paired end	$4,904.79	$5,125.80	$8,473.06
Mid Output:	NaN	NaN	NaN
75 bp, paired end	$1,232.55	$1,453.56	$2,597.47
150 bp, paired end	$1,941.83	$2,162.84	$3,732.32
MiSeq	NaN	NaN	NaN
50 bp, single read (v2 chemistry)	$948.99	$1,132.53	$1,358.78
150 bp, paired end (v2 chemistry)	$1,202.19	$1,481.73	$1,707.98
300 bp, single read (v2 chemistry)	$1,202.19	$1,481.73	$1,707.98
300 bp, paired end (v3 chemistry)	$1,782.99	$2,177.73	$2,403.98

	platform	ends	md_price	machine	total_base
1	HiSeq 3000	single	966.41	HiSeq	50
2	HiSeq 3000	paired	1746.41	HiSeq	150
3	HiSeq 3000	paired	2089.61	HiSeq	200
4	HiSeq 3000	paired	2431.25	HiSeq	300
7	NextSeq 500	single	1632.95	NextSeq	75
8	NextSeq 500	paired	3085.83	NextSeq	150
9	NextSeq 500	paired	4904.79	NextSeq	300
14	MiSeq	single	948.99	MiSeq	50
15	MiSeq	single	1202.19	MiSeq	300
17	MiSeq	single	1782.99	MiSeq	600

	Platform	Run Type	Internal / UT	External Academic	External Commercial
0	HiSeq 2500	SR 50	$1052	$1331	$1357
1	HiSeq 2500	SR 100	$1428	$1806	$1842
2	HiSeq 2500	PE 125	$2520	$3187	$3250
3	HiSeq 4000	SR 50 (1)	$ 1,043	$ 1,319	$1,345
4	HiSeq 4000	PE 150 (1)	$ 2,562	$ 3,239	$ 3,266
5	HiSeq 4000	SR 50	$ 1,697	$ 2,146	$ 2,173
6	HiSeq 4000	PE 75	$ 14,494	$ 18,327	$ 18,531
7	NextSeq 500	SR 75 H.O (1,3)	$ 2,302	$ 2,906	$ 3,060
8	NextSeq 500	PE 75 H.O . (1,3)	$ 3,826	$ 4,834	$ 4,988
9	NextSeq 500	PE 150 H.O.	$ 5,735	$ 7,250	$ 7,404
10	MiSeq	V2 - 300 cycles	$ 1,627	$ 2,054	$ 2,157
11	MiSeq	V2 - 500 cycles (1)	$ 1,771	$ 2,236	$ 2,339
12	MiSeq	V3 - 150 cycles	$ 1,463	$ 1,847	$ 1,950
13	MiSeq	V3 - 600 cycles (1)	$ 2,183	$ 2,758	$ 2,860