Example Statistics - MDF Datasets

Example: We want to know how many datasets are in MDF and which datasets have the most records.

Note: This example is not kept up-to-date with the latest statistics.

If you want the current MDF statistics, you must run this code yourself.



In [1]:

    
from tqdm import tqdm
import pandas as pd
from mdf_forge.forge import Forge



In [2]:

    
mdf = Forge()



In [3]:

    
# First, let's search for all the datasets. There are less than 10,000 currently, so `search()` will work fine.
res = mdf.search("mdf.resource_type:dataset", advanced=True)
# Now, let's pull out the source_name, title, and number of records for each dataset.
mdf_resources = []
for r in tqdm(res):
    q = "mdf.resource_type:record AND mdf.source_name:" + r["mdf"]["source_name"]
    x, info = mdf.search(q, advanced=True, info=True, limit=0)
    mdf_resources.append((r['mdf']['source_name'], r['dc']["titles"][0]['title'], info["total_query_matches"]))
df = pd.DataFrame(mdf_resources, columns=['source_name', 'title', 'num_records'])









    



100%|██████████| 373/373 [03:21<00:00,  1.85it/s]



In [4]:

    
# Finally, we can print the data we gathered.
print("Number of data resources: {n_datasets}".format(n_datasets=len(df)))
df.sort_values(by="num_records", ascending=False).head(15)









    



Number of data resources: 373






    Out[4]:







  
    
      
      source_name
      title
      num_records
    
  
  
    
      372
      sstein_stein_bandgap_2019
      Machine learning of optical properties of mate...
      478111
    
    
      78
      oqmd
      The Open Quantum Materials Database
      395348
    
    
      338
      stein_bandgap_2019
      Machine learning of optical properties of mate...
      180900
    
    
      75
      h2o_13
      Machine-learning approach for one- and two-bod...
      45482
    
    
      74
      ab_initio_solute_database
      High-throughput Ab-initio Dilute Solute Diffus...
      31488
    
    
      249
      nist_xps_db
      NIST X-ray Photoelectron Spectroscopy Database
      29189
    
    
      4
      jarvis
      JARVIS - Joint Automated Repository for Variou...
      26559
    
    
      6
      amcs
      The American Mineralogist Crystal Structure Da...
      19842
    
    
      330
      w_14
      Accuracy and transferability of Gaussian appro...
      9693
    
    
      76
      bfcc13
      Cluster expansion made easy with Bayesian comp...
      3783
    
    
      246
      cip
      Evaluation and comparison of classical interat...
      3291
    
    
      2
      sluschi
      Solid and Liquid in Ultra Small Coexistence wi...
      1618
    
    
      331
      surface_crystal_energy
      Data from: Surface energies of elemental crystals
      1216
    
    
      5
      khazana_polymer
      Khazana (Polymer)
      1073
    
    
      327
      mdr_item_1496
      Ultrahigh Carbon Steel Micrographs
      1007



In [5]:

    
# Bonus: How many records are in MDF in total?
df["num_records"].sum()









    Out[5]:





1230958



In [ ]:

	source_name	title	num_records
372	sstein_stein_bandgap_2019	Machine learning of optical properties of mate...	478111
78	oqmd	The Open Quantum Materials Database	395348
338	stein_bandgap_2019	Machine learning of optical properties of mate...	180900
75	h2o_13	Machine-learning approach for one- and two-bod...	45482
74	ab_initio_solute_database	High-throughput Ab-initio Dilute Solute Diffus...	31488
249	nist_xps_db	NIST X-ray Photoelectron Spectroscopy Database	29189
4	jarvis	JARVIS - Joint Automated Repository for Variou...	26559
6	amcs	The American Mineralogist Crystal Structure Da...	19842
330	w_14	Accuracy and transferability of Gaussian appro...	9693
76	bfcc13	Cluster expansion made easy with Bayesian comp...	3783
246	cip	Evaluation and comparison of classical interat...	3291
2	sluschi	Solid and Liquid in Ultra Small Coexistence wi...	1618
331	surface_crystal_energy	Data from: Surface energies of elemental crystals	1216
5	khazana_polymer	Khazana (Polymer)	1073
327	mdr_item_1496	Ultrahigh Carbon Steel Micrographs	1007