Example Statistics - MDF Datasets

Example: We want to know how many datasets are in MDF and which datasets have the most records.

Note: This example is not kept up-to-date with the latest statistics.

If you want the current MDF statistics, you must run this code yourself.


In [1]:
from tqdm import tqdm
import pandas as pd
from mdf_forge.forge import Forge

In [2]:
mdf = Forge()

In [3]:
# First, let's search for all the datasets. There are less than 10,000 currently, so `search()` will work fine.
res = mdf.search("mdf.resource_type:dataset", advanced=True)
# Now, let's pull out the source_name, title, and number of records for each dataset.
mdf_resources = []
for r in tqdm(res):
    q = "mdf.resource_type:record AND mdf.source_name:" + r["mdf"]["source_name"]
    x, info = mdf.search(q, advanced=True, info=True, limit=0)
    mdf_resources.append((r['mdf']['source_name'], r['dc']["titles"][0]['title'], info["total_query_matches"]))
df = pd.DataFrame(mdf_resources, columns=['source_name', 'title', 'num_records'])


100%|██████████| 373/373 [03:21<00:00,  1.85it/s]

In [4]:
# Finally, we can print the data we gathered.
print("Number of data resources: {n_datasets}".format(n_datasets=len(df)))
df.sort_values(by="num_records", ascending=False).head(15)


Number of data resources: 373
Out[4]:
source_name title num_records
372 sstein_stein_bandgap_2019 Machine learning of optical properties of mate... 478111
78 oqmd The Open Quantum Materials Database 395348
338 stein_bandgap_2019 Machine learning of optical properties of mate... 180900
75 h2o_13 Machine-learning approach for one- and two-bod... 45482
74 ab_initio_solute_database High-throughput Ab-initio Dilute Solute Diffus... 31488
249 nist_xps_db NIST X-ray Photoelectron Spectroscopy Database 29189
4 jarvis JARVIS - Joint Automated Repository for Variou... 26559
6 amcs The American Mineralogist Crystal Structure Da... 19842
330 w_14 Accuracy and transferability of Gaussian appro... 9693
76 bfcc13 Cluster expansion made easy with Bayesian comp... 3783
246 cip Evaluation and comparison of classical interat... 3291
2 sluschi Solid and Liquid in Ultra Small Coexistence wi... 1618
331 surface_crystal_energy Data from: Surface energies of elemental crystals 1216
5 khazana_polymer Khazana (Polymer) 1073
327 mdr_item_1496 Ultrahigh Carbon Steel Micrographs 1007

In [5]:
# Bonus: How many records are in MDF in total?
df["num_records"].sum()


Out[5]:
1230958

In [ ]: