HW Assignment 1 for DSM-CER

  • Due date, submitted to your GitHub account by Thur Jan 12, 2017 at 5 pm PST
  • Overview: This HW assignment is primarily designed to introduce you to some aspects of Data Science and clean energy research. A secondary goal of the HW assignment is to get practice with git, GitHub classroom and related tools you will be using in Professor Beck's class.
  • Instructions:
    • Check out this python notebook (DSM-CER HW1 skel) from GitHub classroom (if you are reading this, you probably did it). If you had a friend give you the python notebook, stop now and figure out how to get the correct starter code from GitHub classroom.
      • Rename the notebook as: "DSM-CER HW1 lastname-firstname"
    • Do the assigned reading for DSM-CER (The Data Science review article by Beck et al.). If you have any lingering questions, enter them in the appropriate cell below, in markdown format and colored in _red text_
    • Find a research paper related to your own research area, or a research area of interest that is at the intersection of materials/chemicals science/engineering and Data Science. In the appropriate cell below, add the correct link to the paper so that we can find it.
      • If you cannot find a paper you want to read you can look for papers related to Materials Genome Initative, The Materials Project, Harvard Clean Energy Project, The Open Quantum Materials Database, or references from Part 1.
        • If you still can't find a paper, talk to us on Slack or come to office hours
          • If you still can't find a paper, I am out of suggestions
    • Write a brief (200 words or fewer) summary of the paper, making sure to note the methods that were used as well as how the authors addressed one or more data-related challenges
    • Turn in the completed python notebook using GitHub classroom

Part 1

Any questions from the reading?

From the article, we can see the importance of data science in chemical engineering. My question is, how much effort should we put in data science and traditional experiments respectively? It seems data science combined with simulations can do lots of stuffs. I mean, is it possible that in the future, 90% of research will do computation, and the remaining will do experiments which are already predicted by machine, or do some tests in order to generate some data?

Another question which is not much related to the article, will we get enough skills doing data science at the end of this quarter?

Part 3

Provide your brief summary here in markdown format. Please make every effort that it is free from typos and grammatical mistakes. Excessive typos or basic grammatical mistakes (i.e., that interfere with readability) will be marked down (no pun) 15%. There are Jupyter plug-ins that can do spell-check, but if you are concerned it might be faster to just copy-paste your summary and then format it correctly.

Metal-organic frameworks (MOF) are a kind of nanoporous solids formed by metal ions or clusters and polydentate organic linkers, being used widely in gas separation and storage, catalysis, nonlinear optics, sensing, controlled drug release, and light-harvesting. By a grand canonical Monte Carlo (GCMC) simulations, structural and functional properties of a MOF can be calculated in great agreement with experimental results. However, in computational screening, there are hundreds of thousands of hypothetical MOFs structures because of massive libraries of hypothetical nanoporous materials, making GCMC simulations impossible.

In this article, they used machine learning and cheminformatic models to preselect high-performing structures and discard low ones. They developed accurate quantitative structure-property relationship (QSPR) models by purely geometrical features of the material, like pore size, surface area, and void fraction, and combined with atomic property–weighted radial distribution function (AP-RDF) descriptor to predict CO2 uptake in MOFs. A database of 324,500 hypothetical MOF structures is generated. They selected 10% of the database to form the calibration set randomly used to train the QSPR models. The remaining MOFs formed the test set used to validate their models. GCMC simulations were used to calculate the CO2 uptake of all MOFs. A MOF is classified as high-performing if it possesses an uptake of greater than 4 mmol/g at 1 bar CO2, as low-performing if it is below 4 mmol/g. Then a cutoff parameter can be used at run-time to decide which one is worth of more compute intensive screening. By using this classifier, we would reduce the large number of required GCMC simulations.