Past Performance Information Retrieval System (PPIRS) -> review - not public data
Are they any other dataset we should be considering??
Table gpqueries:contracts.raw
contains the unmodified data from the USASpending.gov archives. It's constructed from <year>_All_Contracts_Full_20160515.csv.zip
files and includes contracts from 2000 to May 15, 2016.
Table gpqueries:contracts.raw
contains 45M rows and 225 columns.
Each row refers to a transaction (a purchase or refund) made by a federal agency. It may be a pizza or an airplane.
The columns are grouped into categories:
unique_transaction_id
-baseandalloptionsvalue
maj_agency_cat
-fundedbyforeignentity
signeddate
-lastdatetoorder
, last_modified_date
contractactiontype
-programacronym
vendorname
-statecode
PlaceofPerformanceCity
-placeofperformancecongressionaldistrict
psc_cat
-manufacturingorganizationtype
agencyid
-idvmodificationnumber
solicitationid
-statutoryexceptiontofairopportunity
organizationaltype
-otherstatutoryauthority
prime_awardee_executive1
-interagencycontractingauthority
Detailed description for each variable is available in the official codebook:
In [1]:
import google.datalab.bigquery as bq
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn import cross_validation as cv
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.metrics import mean_squared_error
from math import sqrt
In [2]:
%%sql
select * from [fiery-set-171213:vrec.sam_exclusions] limit 5
Out[2]:
In [4]:
%%sql
select Exclusion_Type from [fiery-set-171213:vrec.sam_exclusions] group by 1;
Out[4]:
In [5]:
%%sql
select Classification from [fiery-set-171213:vrec.sam_exclusions] group by 1;
Out[5]:
In [6]:
%%sql
select
count(*)
from [fiery-set-171213:vrec.sam_exclusions]
where Classification in ('Firm')
;
Out[6]:
There are 8,659 firms on the SAM exclusion list
In [7]:
%%bq query -n df_query
select
EXTRACT(YEAR FROM Active_Date) as year,
count(*) as count
from `fiery-set-171213.vrec.sam_exclusions`
where Classification in ('Firm')
and Active_Date is not NULL
group by 1
order by 1;
In [8]:
df = df_query.execute(output_options=bq.QueryOutput.dataframe()).result()
ax = df.plot(kind='bar', x='year', title='Excluded Firms per year', figsize=(15,8))
ax.set_xlabel('Year')
ax.set_ylabel('count')
Out[8]:
In [9]:
%%bq query
select
#Name,
SAM_Number,
count(*) as count
from `fiery-set-171213.vrec.sam_exclusions`
where Classification in ('Firm')
#and Active_Date is not NULL
group by 1
order by 2 DESC
limit 5;
Out[9]:
In [10]:
%%bq query
select
NPI,
count(*) as count
from `fiery-set-171213.vrec.sam_exclusions`
where Classification in ('Firm')
#and CAGE is not NULL
group by 1
order by 2 DESC
limit 5;
Out[10]:
In [11]:
%%bq query
select
CAGE,
count(*) as count
from `fiery-set-171213.vrec.sam_exclusions`
where Classification in ('Firm')
#and CAGE is not NULL
group by 1
order by 2 DESC
limit 5;
Out[11]:
NPI and CAGE don't seem to be great keys to join the data - ideally we can use SAM
In [12]:
%%bq query
select *
from `fiery-set-171213.vrec.fapiis`
limit 5
Out[12]:
In [13]:
%%bq query -n df_query
select
EXTRACT(YEAR FROM RECORD_DATE) as year,
count(*) as count
from `fiery-set-171213.vrec.fapiis`
group by 1
order by 1;
In [14]:
df = df_query.execute(output_options=bq.QueryOutput.dataframe()).result()
ax = df.plot(kind='bar', x='year', title='Firms by Record date', figsize=(10,5))
ax.set_xlabel('Year')
ax.set_ylabel('count')
Out[14]:
In [15]:
%%bq query -n df_query
select
EXTRACT(YEAR FROM TERMINATION_DATE) as year,
count(*) as count
from `fiery-set-171213.vrec.fapiis`
group by 1
order by 1;
In [16]:
df = df_query.execute(output_options=bq.QueryOutput.dataframe()).result()
ax = df.plot(kind='bar', x='year', title='Firms by termination date', figsize=(10,5))
ax.set_xlabel('Year')
ax.set_ylabel('count')
Out[16]:
In [17]:
%%bq query
select
AWARDEE_NAME,
DUNS,
count(*) as count
from `fiery-set-171213.vrec.fapiis`
group by 1,2
order by 3 DESC
limit 5;
Out[17]:
In [18]:
%%bq query
select
*
from `fiery-set-171213.vrec.fapiis`
where AWARDEE_NAME in ('ALPHA RAPID ENGINEERING SOLUTIONS')
limit 5;
Out[18]:
In [19]:
%%bq query
select
RECORD_TYPE,
count(*) as count
from `fiery-set-171213.vrec.fapiis`
group by 1
order by 2 DESC
Out[19]:
FAPIIS is not bad with 3002 DUNS code but time range goes only from 2012 to 2017
In [3]:
%%bq query -n df_query
select count(*) as transactions
from `fiery-set-171213.vrec.usa_spending_all`
In [4]:
df = df_query.execute(output_options=bq.QueryOutput.dataframe()).result()
df.head()
Out[4]:
In [22]:
%%bq query
select *
from `fiery-set-171213.vrec.usa_spending_all`
where mod_agency in ('1700: DEPT OF THE NAVY')
limit 5
Out[22]:
In [23]:
%%bq query -n df_query
select
#substr(signeddate, 1, 2) month,
fiscal_year as year,
count(*) transactions,
sum(dollarsobligated) as dollarsobligated
from `fiery-set-171213.vrec.usa_spending_all`
group by year
order by year asc
In [24]:
df = df_query.execute(output_options=bq.QueryOutput.dataframe()).result()
ax = df.set_index('year')['dollarsobligated'].plot(kind='bar', title='Government purchases by years')
ax.set_ylabel('dollars obligated')
Out[24]:
In [25]:
%%bq query -n df_query
select
fiscal_year as year,
sum(dollarsobligated)/count(*) as dollarsobligated
from `fiery-set-171213.vrec.usa_spending_all`
group by year
order by year asc
In [26]:
df = df_query.execute(output_options=bq.QueryOutput.dataframe()).result()
ax = df.set_index('year')['dollarsobligated'].plot(kind='bar', title='avg. transaction size by years')
ax.set_ylabel('dollars obligated')
Out[26]:
Which means we're dealing with 49.5M transactions totalling 6.7 trillion dollars. These purchases came from 622k vendors that won 2.2mn solicitations issued by government agencies.
In [27]:
%%bq query
select
maj_agency_cat,
mod_agency,
count(*)
from `fiery-set-171213.vrec.usa_spending_all`
group by 1,2
order by 3 DESC
limit 20
Out[27]:
In [28]:
%%bq query
select
mod_parent,
vendorname,
sum(dollarsobligated) as dollarsobligated
from `fiery-set-171213.vrec.usa_spending_all`
group by 1,2
order by 3 DESC
limit 20
Out[28]:
In [29]:
%%bq query
select
productorservicecode,
systemequipmentcode,
claimantprogramcode,
principalnaicscode,
sum(dollarsobligated) as dollarsobligated
from `fiery-set-171213.vrec.usa_spending_all`
where vendorname in ('LOCKHEED MARTIN CORPORATION')
group by 1,2,3,4
order by 5 DESC
limit 20
Out[29]:
In [30]:
%%bq query
select
#mod_parent,
vendorname,
systemequipmentcode,
sum(dollarsobligated) as dollarsobligated
from `fiery-set-171213.vrec.usa_spending_all`
where productorservicecode in ('1510: AIRCRAFT, FIXED WING')
group by 1,2
order by 3 DESC
limit 20
Out[30]:
In [32]:
%%bq query
select
vendorname,
systemequipmentcode,
claimantprogramcode,
principalnaicscode,
sum(dollarsobligated) as dollarsobligated
from `fiery-set-171213.vrec.usa_spending_all`
where productorservicecode in ('1510: AIRCRAFT, FIXED WING')
and contractingofficerbusinesssizedetermination in ('S: SMALL BUSINESS')
group by 1,2,3,4
order by dollarsobligated DESC
limit 20
Out[32]:
In [190]:
%%bq query
select
*
from `gpqueries.contracts.raw`
where productorservicecode in ('1510: AIRCRAFT, FIXED WING')
and contractingofficerbusinesssizedetermination in ('S: SMALL BUSINESS')
limit 1
Out[190]:
In [33]:
%%bq query
select
claimantprogramcode,
principalnaicscode,
sum(dollarsobligated) as dollarsobligated
from `fiery-set-171213.vrec.usa_spending_all`
where contractingofficerbusinesssizedetermination in ("S: SMALL BUSINESS")
group by 1,2
order by dollarsobligated DESC
limit 10
Out[33]:
In [34]:
%%bq query -n df_query
select
fiscal_year,
sum(dollarsobligated) as dollarsobligated
from `fiery-set-171213.vrec.usa_spending_all`
where contractingofficerbusinesssizedetermination in ("S: SMALL BUSINESS")
group by 1
order by 1
In [35]:
df = df_query.execute(output_options=bq.QueryOutput.dataframe()).result()
plt = df.set_index('fiscal_year')['dollarsobligated'].plot(kind='bar', title='transactions amount for SMBs')
In [36]:
%%bq query -n df_query
#%%sql
select
smb.fiscal_year,
sum(smb.transaction) as smb,
sum(total.transaction) as total,
sum(smb.transaction)/sum(total.transaction) as percentage
from
(select
fiscal_year,
sum(dollarsobligated) as transaction
from `fiery-set-171213.vrec.usa_spending_all`
where contractingofficerbusinesssizedetermination in ("S: SMALL BUSINESS")
group by 1) as smb
join
(select
fiscal_year,
sum(dollarsobligated) as transaction
from `fiery-set-171213.vrec.usa_spending_all`
group by 1) as total
on smb.fiscal_year = total.fiscal_year
group by 1
order by 1
In [37]:
df = df_query.execute(output_options=bq.QueryOutput.dataframe()).result()
plt = df.set_index('fiscal_year')['percentage'].plot(kind='bar', title='dollars % for SMBs')
In [38]:
%%bq query
select
smb.principalnaicscode as principalnaicscode,
sum(total.count) as count,
sum(smb.dollarsobligated) as dollarsobligated_smb,
sum(total.dollarsobligated) as dollarsobligated_total,
sum(smb.dollarsobligated)/sum(total.dollarsobligated) as smb_percentage
from
(select
principalnaicscode,
count(*) as count,
sum(dollarsobligated) as dollarsobligated
from `fiery-set-171213.vrec.usa_spending_all`
where contractingofficerbusinesssizedetermination in ("S: SMALL BUSINESS")
group by 1) as smb
join
(select
principalnaicscode,
count(*) as count,
sum(dollarsobligated) as dollarsobligated
from `fiery-set-171213.vrec.usa_spending_all`
group by 1
having dollarsobligated > 0
) as total
on smb.principalnaicscode = total.principalnaicscode
group by 1
order by 5 DESC
limit 10
Out[38]:
In [176]:
%%bq query -n df_query
select
maj_agency_cat,
#mod_agency,
#contractactiontype,
#typeofcontractpricing,
#performancebasedservicecontract,
state,
#vendorcountrycode,
#principalnaicscode,
contractingofficerbusinesssizedetermination,
#sum(dollarsobligated) as dollarsobligated
dollarsobligated
from `fiery-set-171213.vrec.usa_spending_all`
where vendorcountrycode in ('UNITED STATES', 'USA: UNITED STATES OF AMERICA')
and contractingofficerbusinesssizedetermination in ('O: OTHER THAN SMALL BUSINESS', 'S: SMALL BUSINESS')
and dollarsobligated > 0
#group by 1,2,3
limit 20000
In [177]:
df = df_query.execute(output_options=bq.QueryOutput.dataframe()).result()
df.head()
Out[177]:
In [178]:
# Create dummy variable using pandas function get_dummies
df1 = df.join(pd.get_dummies(df['maj_agency_cat']))
df1 = df1.join(pd.get_dummies(df['state']))
df1 = df1.join(pd.get_dummies(df['contractingofficerbusinesssizedetermination']))
df1 = df1.drop('maj_agency_cat', axis = 1)
df1 = df1.drop('state', axis = 1)
df1 = df1.drop('contractingofficerbusinesssizedetermination', axis = 1)
df1.head()
Out[178]:
In [179]:
train_data = df1.iloc[:,1:]
train_labels = df[['dollarsobligated']]
lm = LinearRegression()
lm.fit(train_data, train_labels)
# The coefficients
print('Coefficients: \n', lm.coef_)
# The mean squared error
print("Mean squared error: %.2f"
% np.mean((lm.predict(train_data) - train_labels) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % lm.score(train_data, train_labels))
In [182]:
%%bq query -n df_query
select
vendorname,
maj_agency_cat,
state,
contractingofficerbusinesssizedetermination,
count(*) as count,
sum(dollarsobligated) as dollarsobligated
from `fiery-set-171213.vrec.usa_spending_all`
where vendorcountrycode in ('UNITED STATES', 'USA: UNITED STATES OF AMERICA')
and contractingofficerbusinesssizedetermination in ('O: OTHER THAN SMALL BUSINESS', 'S: SMALL BUSINESS')
and dollarsobligated > 0
group by 1,2,3,4
limit 20000
In [183]:
df = df_query.execute(output_options=bq.QueryOutput.dataframe()).result()
df.head()
Out[183]:
In [185]:
#Create dummy variable using pandas function get_dummies
df1 = df.join(pd.get_dummies(df['maj_agency_cat']))
df1 = df1.join(pd.get_dummies(df['state']))
df1 = df1.join(pd.get_dummies(df['contractingofficerbusinesssizedetermination']))
df1 = df1.drop('maj_agency_cat', axis = 1)
df1 = df1.drop('state', axis = 1)
df1 = df1.drop('contractingofficerbusinesssizedetermination', axis = 1)
df1 = df1.drop('vendorname', axis = 1)
df1 = df1.drop('dollarsobligated', axis = 1)
train_data = df1.iloc[:,1:]
train_labels = df[['count']]
lm = LinearRegression()
lm.fit(train_data, train_labels)
# The coefficients
print('Coefficients: \n', lm.coef_)
# The mean squared error
print("Mean squared error: %.2f"
% np.mean((lm.predict(train_data) - train_labels) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % lm.score(train_data, train_labels))
Ranking based on 'counts' of number of contracts that occured
TO-DO check the uppercase and lowercase in the REGEX
In [65]:
%%bq query
select
#principalnaicscode,
fiscal_year,
maj_agency_cat,
#contractingofficerbusinesssizedetermination,
#vendorname,
productorservicecode,
count(*) as count,
sum(dollarsobligated) as dollarsobligated
from `fiery-set-171213.vrec.usa_spending_all`
#where contractingofficerbusinesssizedetermination in ("S: SMALL BUSINESS")
#where regexp_contains(principalnaicscode, "CONSTRUCTION")
#and regexp_contains(maj_agency_cat, "AGRICULTURE")
where regexp_contains(productorservicecode, "MEAT")
#and fiscal_year = 2016
group by 1,2,3
order by dollarsobligated DESC
limit 10
Out[65]:
In [2]:
%%bq query -n df_query
select
contractingofficerbusinesssizedetermination,
mod_agency,
vendorname,
count(*) as count
from `fiery-set-171213.vrec.usa_spending_all`
where vendorcountrycode in ('UNITED STATES', 'USA: UNITED STATES OF AMERICA')
and contractingofficerbusinesssizedetermination in ('O: OTHER THAN SMALL BUSINESS', 'S: SMALL BUSINESS')
and mod_agency not in ("")
group by 1,2,3
order by count DESC
limit 20000
In [3]:
df = df_query.execute(output_options=bq.QueryOutput.dataframe()).result()
df.head()
Out[3]:
In [4]:
df1 = df.drop('contractingofficerbusinesssizedetermination', axis = 1)
n_agency = df1.mod_agency.unique().shape[0]
n_vendors = df1.vendorname.unique().shape[0]
print 'Number of gov agency = ' + str(n_agency) + ' | Number of vendors = ' + str(n_vendors)
In [5]:
# Convert categorial value with label encoding
le_agency = LabelEncoder()
label_agency = le_agency.fit_transform(df1['mod_agency'])
le_vendor = LabelEncoder()
label_vendor = le_vendor.fit_transform(df1['vendorname'])
df_agency = pd.DataFrame(label_agency)
df_vendor = pd.DataFrame(label_vendor)
df2 = pd.concat([df_agency, df_vendor], axis = 1)
df2 = pd.concat([df2, df1['count']], axis = 1)
df2.columns = ['mod_agency', 'vendorname', 'count']
df2.head(5)
# To ge the right label back
# le_agency.inverse_transform([173, 100])
Out[5]:
In [6]:
# Split into training and test data set
train_data, test_data = cv.train_test_split(df2, test_size=0.25)
In [7]:
#Build the matrix
train_data_matrix = np.zeros((n_agency, n_vendors))
for line in train_data.itertuples():
train_data_matrix[line[1]-1, line[2]-1] = line[3]
test_data_matrix = np.zeros((n_agency, n_vendors))
for line in test_data.itertuples():
test_data_matrix[line[1]-1, line[2]-1] = line[3]
#Compute cosine distance
user_similarity = pairwise_distances(train_data_matrix, metric='cosine')
item_similarity = pairwise_distances(train_data_matrix.T, metric='cosine')
def predict(ratings, similarity, type='user'):
if type == 'user':
mean_user_rating = ratings.mean(axis=1)
#You use np.newaxis so that mean_user_rating has same format as ratings
ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
elif type == 'item':
pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
return pred
item_prediction = predict(train_data_matrix, item_similarity, type='item')
user_prediction = predict(train_data_matrix, user_similarity, type='user')
In [8]:
# Evaluation
def rmse(prediction, ground_truth):
prediction = prediction[ground_truth.nonzero()].flatten()
ground_truth = ground_truth[ground_truth.nonzero()].flatten() #filter out all items with no 0 as we only want to predict in the test set
return sqrt(mean_squared_error(prediction, ground_truth))
print 'User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix))
print 'Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix))
In [10]:
print 'Worklow 1'
print '=' * 100
print 'Select your agency:'
agency = df1['mod_agency'][10]
print agency
print '=' * 100
print '1. Have you considered working with these SMB companies (user prediction)?'
agency = le_agency.transform(agency)
vendor_reco = pd.DataFrame(user_prediction[agency, :])
labels = pd.DataFrame(le_vendor.inverse_transform(range(0, len(vendor_reco))))
df_reco = pd.concat([vendor_reco, labels], axis = 1)
df_reco.columns = ['reco_score', 'vendorname']
#Join to get the SMB list
df_smb = df.drop(['mod_agency', 'count'], axis = 1)
df_reco = df_reco.set_index('vendorname').join(df_smb.set_index('vendorname'))
df_reco = df_reco.sort_values(['reco_score'], ascending = [0])
df_reco[df_reco['contractingofficerbusinesssizedetermination'] == 'S: SMALL BUSINESS'].head(10)
Out[10]:
In [34]:
print '=' * 100
print '2. Have you considered working with these SMB companies (item-item prediction?)'
vendor_reco = pd.DataFrame(item_prediction[agency, :])
df_reco = pd.concat([vendor_reco, labels], axis = 1)
df_reco.columns = ['reco_score', 'vendorname']
df_reco = df_reco.set_index('vendorname').join(df_smb.set_index('vendorname'))
df_reco = df_reco.sort_values(['reco_score'], ascending = [0])
df_reco[df_reco['contractingofficerbusinesssizedetermination'] == 'S: SMALL BUSINESS'].head(10)
Out[34]:
In [35]:
print 'Worklow 2'
print '=' * 100
print 'Select a vendor:'
# Workflow 2 - WIP
# Select a vendor
# Other similar vendor
Suppose you want to start selling to the government. While FBO.gov publishes government RFPs and you can apply there, government agencies often issue requests when they've already chosen the supplier. Agencies go through FBO.gov because it's a mandatory step for deals north of $25K. But winning at this stage is unlikely if an RFP is already tailored for another supplier.
Reaching warm leads in advance would increase chances of winning a government contract. The contracts data helps identify the warm leads by looking at purchases in the previous years.
There're several ways of searching through those years.
The goods and services bought in each transaction are encoded in the variable productorservicecode
. Top ten product categories according to this variable:
In [6]:
%%sql
select
substr(productorservicecode, 1, 4) product_id,
first(substr(productorservicecode, 7)) product_name,
count(*) transactions,
sum(dollarsobligated) sum_dollarsobligated
from
gpqueries:contracts.raw
group by
product_id
order by
sum_dollarsobligated desc
limit 10
Out[6]:
You can find agencies that buy products like yours. If it's "software":
In [7]:
%%sql
select
substr(agencyid, 1, 4) agency_id,
first(substr(agencyid, 7)) agency_name,
count(*) transactions,
sum(dollarsobligated) sum_dollarsobligated
from
gpqueries:contracts.raw
where
productorservicecode contains 'software'
group by
agency_id
order by
sum_dollarsobligated desc
ignore case
Out[7]:
Another way to find customers is the variable principalnaicscode
that encodes the industry in which the vendor does business.
The list of NAICS codes is available at Census.gov, but you can do text search in the table. Let's find who bought software from distributors in 2015:
In [6]:
%%sql
select
substr(agencyid, 1, 4) agency_id,
first(substr(agencyid, 7)) agency_name,
substr(principalnaicscode, 1, 6) naics_id,
first(substr(principalnaicscode, 9)) naics_name,
count(*) transactions,
sum(dollarsobligated) sum_dollarsobligated
from
gpqueries:contracts.raw
where
principalnaicscode contains 'software' and
fiscal_year = 2015
group by
agency_id, naics_id
order by
sum_dollarsobligated desc
ignore case
Out[6]:
You can learn details from looking at transactions for a specific (agency, NAICS)
pair. For example, what software does TSA buy?
In [7]:
%%sql
select
fiscal_year,
dollarsobligated,
vendorname, city, state, annualrevenue, numberofemployees,
descriptionofcontractrequirement
from
gpqueries:contracts.raw
where
agencyid contains 'transportation security administration' and
principalnaicscode contains 'computer and software stores'
ignore case
Out[7]:
Alternatively, specify vendors your product relates to and check how the government uses it. Top deals in data analytics:
In [8]:
%%sql
select
agencyid,
dollarsobligated,
vendorname,
descriptionofcontractrequirement
from
gpqueries:contracts.raw
where
vendorname contains 'tableau' or
vendorname contains 'socrata' or
vendorname contains 'palantir' or
vendorname contains 'revolution analytics' or
vendorname contains 'mathworks' or
vendorname contains 'statacorp' or
vendorname contains 'mathworks'
order by
dollarsobligated desc
limit
100
ignore case
Out[8]:
Full-text search and regular expressions for the variable descriptionofcontractrequirement
narrow results for relevant product groups:
In [9]:
%%sql
select
agencyid,
dollarsobligated,
descriptionofcontractrequirement
from
gpqueries:contracts.raw
where
descriptionofcontractrequirement contains 'body camera'
limit
100
ignore case
Out[9]:
Some rows of descriptionofcontractrequirement
contain codes like "IGF::CT::IGF". These codes classify the purchase into three groups of "Inherently Governmental Functions" (IGF):
You can find local opportunities using variables for vendors (city
, state
) and services sold (PlaceofPerformanceCity
, pop_state_code
). The states where most contracts are delivered in:
In [10]:
%%sql
select
substr(pop_state_code, 1, 2) state_code,
first(substr(pop_state_code, 4)) state_name,
sum(dollarsobligated) sum_dollarsobligated
from
gpqueries:contracts.raw
group by
state_code
order by
sum_dollarsobligated desc
Out[10]:
Let's check some popular statements about government contracting.
Contractors had to report their revenue and the number of employees. It makes easy to check if small business is welcomed in government contracting:
In [11]:
%%sql --module gpq
define query vendor_size_by_agency
select
substr(agencyid, 1, 4) agency_id,
first(substr(agencyid, 7)) agency_name,
nth(11, quantiles(annualrevenue, 21)) vendor_median_annualrevenue,
nth(11, quantiles(numberofemployees, 21)) vendor_median_numberofemployees,
count(*) transactions,
sum(dollarsobligated) sum_dollarsobligated
from
gpqueries:contracts.raw
group by
agency_id
having
transactions > 1000 and
sum_dollarsobligated > 10e6
order by
vendor_median_annualrevenue asc
In [12]:
bq.Query(gpq.vendor_size_by_agency).to_dataframe()
Out[12]:
The median shows the most likely supplier. Agencies on the top of the table actively employ vendors whose annual revenue is less than $1mn.
The Department of Defence, the largest buyer with $4.5tn worth of goods and services bought over these 17 years, has the median vendor with $2.5mn in revenue and 20 employees. It means that half of the DoD's vendors have less than $2.5mn in revenue.
Set-aside purchases are reserved for special categories of suppliers, like women-, minority-, and veteran-owned businesses. There's a lot of confusion about their share in transactions. We can settle this confusion with data:
In [13]:
%%sql
select
womenownedflag,
count(*) transactions,
sum(dollarsobligated) sum_dollarsobligated
from
gpqueries:contracts.raw
group by
womenownedflag
Out[13]:
Women-owned businesses make about one tenth of the transactions, but their share in terms of sales is only 3.7%.
A cross-tabulation for major set-aside categories:
In [14]:
%%sql
select
womenownedflag, veteranownedflag, minorityownedbusinessflag,
count(*) transactions,
sum(dollarsobligated) sum_dollarsobligated
from
gpqueries:contracts.raw
group by
womenownedflag, veteranownedflag, minorityownedbusinessflag
order by
womenownedflag, veteranownedflag, minorityownedbusinessflag desc
Out[14]:
For example, firms owned by women, veterans, and minorities (all represented at the same time) sell $5bn in goods and services. That's 0.07% of all government purchases.
Becoming a government contractor may seem difficult at first, but let's see how many new contractors the government had in 2015.
In [15]:
%%sql
select
sum(if(before2015.dunsnumber is null, 1, 0)) new_vendors,
sum(if(before2015.dunsnumber is null, 0, 1)) old_vendors
from
flatten((select unique(dunsnumber) dunsnumber from gpqueries:contracts.raw where fiscal_year = 2015), dunsnumber) in2015
left join
flatten((select unique(dunsnumber) dunsnumber from gpqueries:contracts.raw where fiscal_year < 2015), dunsnumber) before2015
on before2015.dunsnumber = in2015.dunsnumber
Out[15]:
(The variable dunsnumber
refers to the unique number each contractor obtains at SAM.gov. It identifies unique contractors better than vendorname
, which varies.)
In 2015, 12% of the suppliers turned out to be companies that had never sold to the government.
This was a short intro to government contracting data.
For more, check the Government Procurement Queries project.