SAS Datasets

In the lab we commonly use SAS as our main statistics package. There are times where you want to take a SAS dataset and do something with it in python. Typically you would have to use SAS to export the dataset as a CSV and then import it in python. However, there is a python module that allow you to directly access a SAS dataset and dump its contents into a pandas dataframe.

Install the sas7bdat library.

pip install sas7bdat --user

Once installed you can do the following.


In [1]:
import sas7bdat

In [5]:
with sas7bdat.SAS7BDAT('fbgn2coord.sas7bdat') as FH:
    df = FH.to_data_frame()


[fbgn2coord.sas7bdat] header length 65536 != 8192
[fbgn2coord.sas7bdat] [fbgn2coord.sas7bdat] header length 65536 != 8192
WARNING:/mnt/SHARE/McIntyre_Lab/useful_dmel_data/flybase557/sas_data/fbgn2coord.sas7bdat:[fbgn2coord.sas7bdat] [fbgn2coord.sas7bdat] header length 65536 != 8192

In [8]:
df.head()


Out[8]:
chrom end primary_fbgn start strand
0 3R 2648518 FBgn0000003 2648220 +
1 2R 18060339 FBgn0000008 18024473 +
2 3R 12655771 FBgn0000014 12632936 -
3 3R 12797958 FBgn0000015 12752932 -
4 3L 16640982 FBgn0000017 16608966 -

This package also comes with a nice command line tool that allows you to get the list of columns from a file.


In [9]:
%%bash
sas7bdat_to_csv --header fbgn2coord.sas7bdat



[fbgn2coord.sas7bdat] header length 65536 != 8192
[fbgn2coord.sas7bdat] Header:
	col_count_p1: 5
	col_count_p2: 0
	column_count: 5
	compression: None
	creator: None
	creator_proc: SORT
	date_created: 2014-12-04 18:04:03.851510
	date_modified: 2014-12-04 18:04:03.851510
	endianess: little
	file_type: DATA
	filename: fbgn2coord.sas7bdat
	header_length: 65536
	lcp: 4
	lcs: 0
	mix_page_row_count: 1123
	name: FBGN2COORD
	os_name: x86_64
	os_type: 3.13.0-40-generi
	page_count: 15
	page_length: 65536
	platform: unix
	row_count: 17294
	row_length: 56
	sas_release: 9.0401M0
	server_type: Linux
	u64: True

Contents of dataset "FBGN2COORD":
Num Name         Type   Length Format Label
--- ------------ ------ ------ ------ -----
  1 primary_fbgn string     11 $           
  2 chrom        string     25 $           
  3 start        number      8 BEST        
  4 end          number      8 BEST        
  5 strand       string      1 $           


In [ ]: