We are going to use SQLAlchemy first to pull a list of tables, and then to create an ORM of the table we care about, in this case SNP147.
To do this we will be makign use of SQLAlchemy's autmapping functionality, which creates ORM object directly from tables. Unfortunately, it fails with a number of USCS tables because they don't have detectable primary keys, which are required for the ORM. Because of this we will first inspect the database and dump a list of tables, and then explicity declare a class for the table we care about. SQLAlchemy will fill in all extra columns for us, so we only need to explicitly declare the class and primary key, everything else is done for us.
In [1]:
import pandas as pd
from sqlalchemy import inspect
from sqlalchemy import create_engine
from sqlalchemy import Column, Integer, String
from sqlalchemy.orm import Session
from sqlalchemy.sql import select
from sqlalchemy.ext.automap import automap_base
In [2]:
# Connect to the hg19 database
engine = create_engine("mysql+pymysql://genome@genome-mysql.cse.ucsc.edu/{organism}?charset=utf8mb4".format(organism='hg19'))
In [3]:
# Get the list of tables
inspector = inspect(engine)
tables = inspector.get_table_names()
This list is very long. You can just look through it, but we are going to filter it for tables that begin with 'snp'
In [4]:
len(tables)
Out[4]:
In [5]:
snp_tables = [i for i in tables if i.startswith('snp')]
In [6]:
sorted(snp_tables)
Out[6]:
OK, so let's use snp147.
In [7]:
inspector.get_columns('snp147')
Out[7]:
There are a lot of columns, but the vast majority will automap just fine, so we don't need to do anything. However, we do need to add a primary key as none exists here.
I actually just want to look up the SNP name by the position, so I am going to explicitly declare a primary key (since I don't care about their primary key and some UCSC tables have primary keys that are not dectected by the automapper) and also declare the columns I care about. The table will end up with all columns, but I want to guarantee access to the columns that I care about.
In [8]:
# The automap base will detect all tables in the database and create classes for
# as many as it can. Many UCSC tables don't become classes because the primary key
# cannot be detected.
Base = automap_base()
class SNP147(Base):
__tablename__ = 'snp147'
name = Column(String(length=15), primary_key=True, nullable=False)
# The following columns do not need to be declared, the automapper will do it for
# us. I map them anyway for my own personal reference.
chrom = Column(String(length=31), nullable=False)
chromStart = Column(Integer, nullable=False)
chromEnd = Column(Integer, nullable=False)
# reflect the tables
Base.prepare(engine, reflect=True)
In [9]:
session = Session(engine)
In [10]:
session.query(SNP147.name).filter(SNP147.chrom == 'chr1').filter(SNP147.chromEnd == 16952481).first()
Out[10]:
In [11]:
positions = [
154326279,
11029552,
241803636,
59165838,
39991588,
204733046,
16341354,
16971948,
154056834,
9712006
]
In [12]:
q = session.query(SNP147.name).filter(SNP147.chrom == 'chr1').filter(SNP147.chromEnd.in_(positions))
In [13]:
q.all()
Out[13]:
In [14]:
df = pd.read_sql_query(q.statement, engine)
In [15]:
df
Out[15]:
In [16]:
df2 = pd.read_sql_query(
session.query(SNP147.name, SNP147.chrom, SNP147.chromEnd).filter(SNP147.chrom == 'chr1').filter(SNP147.chromEnd.in_(positions)).statement,
engine
)
In [17]:
df2
Out[17]:
That's it, simple! Yes, you can just query the MySQL directly, but this approach is much more robust for on-the-fly data analysis. It is easy to get SQL syntax wrong, but using SQLAlchemy in Jupyter is trivial.