make horizontal versions of roc-models-2x4.pdf and model-run-times-ba…

…r.pdf add MACE and matlantis refs
janosh · Jun 20, 2023 · 551050e · 551050e
1 parent 4b9da09
commit 551050e
Show file tree

Hide file tree

Showing 10 changed files with 334 additions and 53 deletions.
diff --git a/matbench_discovery/structure.py b/matbench_discovery/structure.py
@@ -10,7 +10,8 @@
 
 
 def perturb_structure(struct: Structure, gamma: float = 1.5) -> Structure:
-    """Perturb the atomic coordinates of a pymatgen structure.
+    """Perturb the atomic coordinates of a pymatgen structure. Used for CGCNN+P
+    training set augmentation.
 
     Args:
         struct (Structure): pymatgen structure to be perturbed

diff --git a/readme.md b/readme.md
@@ -17,7 +17,7 @@
 
 Matbench Discovery is an [interactive leaderboard](https://janosh.github.io/matbench-discovery/models) and associated [PyPI package](https://pypi.org/project/matbench-discovery) which together make it easy to rank ML energy models on a task designed to closely simulate a high-throughput discovery campaign for new stable inorganic crystals.
 
-So far, we've tested 8 models covering multiple methodologies ranging from random forests with structure fingerprints to graph neural networks, from one-shot predictors to iterative Bayesian optimizers and interatomic potential-based relaxers. We find [CHGNet](https://github.com/CederGroupHub/chgnet) ([paper](https://doi.org/10.48550/arXiv.2302.14231)) to achieve the highest F1 score of 0.59, $R^2$ of 0.61 and a discovery acceleration factor (DAF) of 3.06 (meaning a 3x higher rate of stable structures compared to dummy selection in our already enriched search space). We believe our results show that ML models have become robust enough to deploy them as triaging steps to more effectively allocate compute in high-throughput DFT relaxations. This work provides valuable insights for anyone looking to build large-scale materials databases.
+So far, we've tested 8 models covering multiple methodologies ranging from random forests with structure fingerprints to graph neural networks, from one-shot predictors to iterative Bayesian optimizers and interatomic potential relaxers. We find [CHGNet](https://github.com/CederGroupHub/chgnet) ([paper](https://doi.org/10.48550/arXiv.2302.14231)) to achieve the highest F1 score of 0.59, $R^2$ of 0.61 and a discovery acceleration factor (DAF) of 3.06 (meaning a 3x higher rate of stable structures compared to dummy selection in our already enriched search space). We believe our results show that ML models have become robust enough to deploy them as triaging steps to more effectively allocate compute in high-throughput DFT relaxations. This work provides valuable insights for anyone looking to build large-scale materials databases.
 
 <slot name="metrics-table" />
 

diff --git a/scripts/analyze_element_errors.py b/scripts/analyze_element_errors.py
@@ -31,6 +31,22 @@
 )
 
 
+# %% map average model error onto elements
+frac_comp_col = "fractional composition"
+df_wbm[frac_comp_col] = [
+    Composition(comp).fractional_composition for comp in tqdm(df_wbm.formula)
+]
+
+df_frac_comp = pd.DataFrame(comp.as_dict() for comp in df_wbm[frac_comp_col]).set_index(
+    df_wbm.index
+)
+assert all(
+    df_frac_comp.sum(axis=1).round(6) == 1
+), "composition fractions don't sum to 1"
+
+# df_frac_comp = df_frac_comp.dropna(axis=1, thresh=100)  # remove Xe with only 1 entry
+
+
 # %%
 df_mp = pd.read_csv(DATA_FILES.mp_energies, na_filter=False).set_index("material_id")
 # compute number of samples per element in training set
@@ -50,22 +66,6 @@
 fig.show()
 
 
-# %% map average model error onto elements
-frac_comp_col = "fractional composition"
-df_wbm[frac_comp_col] = [
-    Composition(comp).fractional_composition for comp in tqdm(df_wbm.formula)
-]
-
-df_frac_comp = pd.DataFrame(comp.as_dict() for comp in df_wbm[frac_comp_col]).set_index(
-    df_wbm.index
-)
-assert all(
-    df_frac_comp.sum(axis=1).round(6) == 1
-), "composition fractions don't sum to 1"
-
-# df_frac_comp = df_frac_comp.dropna(axis=1, thresh=100)  # remove Xe with only 1 entry
-
-
 # %%
 for label, srs in (
     ("MP", df_elem_err[train_count_col]),

diff --git a/scripts/calc_wandb_model_runtimes.py b/scripts/calc_wandb_model_runtimes.py
@@ -11,6 +11,7 @@
 
 import pandas as pd
 import plotly.express as px
+import plotly.graph_objects as go
 import requests
 import wandb
 import wandb.apis.public
@@ -122,9 +123,6 @@
 print(f"{df_stats[time_col].sum()=:.0f} hours")
 
 # df_stats.round(2).to_json(f"{MODELS}/model-stats.json", orient="index")
-
-
-# %% plot model run times as pie chart
 df_time = (
     df_stats.sort_index()
     .filter(like=time_col)
@@ -134,6 +132,9 @@
     # .drop(index="BOWSR + MEGNet")
     .reset_index(names=(model_col := "Model"))
 )
+
+
+# %% plot model run times as pie chart
 fig = px.pie(
     df_time,
     values=time_col,
@@ -179,18 +180,34 @@
 
 # %% plot model run times as bar chart
 fig = df_melt.dropna().plot.bar(
-    y=time_col,
-    x=model_col,
+    x=time_col,
+    y=model_col,
     backend="plotly",
     # color=time_col,
     text_auto=".0f",
     text=time_col,
     color=model_col,
 )
-title = f"Total: {df_stats[time_col].sum():.0f} h"
+# reduce bar width
+fig.update_traces(width=0.7)
+
+title = f"All models: {df_stats[time_col].sum():.0f} h"
 fig.layout.legend.update(x=0.98, y=0.98, xanchor="right", yanchor="top", title=title)
 fig.layout.xaxis.title = ""
 fig.layout.margin.update(l=0, r=0, t=0, b=0)
-save_fig(fig, f"{FIGS}/model-run-times-bar.svelte")
-save_fig(fig, f"{PDF_FIGS}/model-run-times-bar.pdf")
+# save_fig(fig, f"{FIGS}/model-run-times-bar.svelte")
+
+pdf_fig = go.Figure(fig)
+# replace legend with annotation in PDF
+pdf_fig.layout.showlegend = False
+pdf_fig.add_annotation(
+    text=title,
+    font=dict(size=15),
+    x=0.99,
+    y=0.99,
+    showarrow=False,
+    xref="paper",
+    yref="paper",
+)
+save_fig(pdf_fig, f"{PDF_FIGS}/model-run-times-bar.pdf", height=300, width=800)
 fig.show()
diff --git a/scripts/prc_roc_curves_models.py b/scripts/prc_roc_curves_models.py
@@ -5,6 +5,8 @@
 
 
 # %%
+import math
+
 import pandas as pd
 from pymatviz.utils import save_fig
 from sklearn.metrics import auc, precision_recall_curve, roc_curve
@@ -23,6 +25,9 @@
 facet_col = "Model"
 color_col = "Stability Threshold"
 
+n_cols = 4
+n_rows = math.ceil(len(models) // n_cols)
+
 
 # %%
 df_roc = pd.DataFrame()
@@ -50,7 +55,7 @@
         x="FPR",
         y="TPR",
         facet_col=facet_col,
-        facet_col_wrap=2,
+        facet_col_wrap=4,
         backend="plotly",
         height=150 * len(df_roc[facet_col].unique()),
         color=color_col,
@@ -59,33 +64,32 @@
         range_color=(-0.5, 0.5),
         hover_name=facet_col,
         hover_data={facet_col: False},
+        facet_col_spacing=0.03,
+        facet_row_spacing=0.1,
     )
 )
 
 for anno in fig.layout.annotations:
     anno.text = anno.text.split("=", 1)[1]  # remove Model= from subplot titles
 
-fig.layout.coloraxis.colorbar.update(
-    x=1,
-    y=1,
-    xanchor="right",
-    yanchor="top",
-    thickness=14,
-    lenmode="pixels",
-    len=210,
-    title_side="right",
-)
+fig.layout.coloraxis.colorbar.update(thickness=14, title_side="right")
+if n_cols == 2:
+    fig.layout.coloraxis.colorbar.update(
+        x=1, y=1, xanchor="right", yanchor="top", lenmode="pixels", len=210
+    )
+
 fig.add_shape(type="line", x0=0, y0=0, x1=1, y1=1, line=line, row="all", col="all")
 fig.add_annotation(text="No skill", x=0.5, y=0.5, showarrow=False, yshift=-10)
 # allow scrolling and zooming each subplot individually
 fig.update_xaxes(matches=None)
+fig.layout.margin.update(l=0, r=0, b=0, t=20, pad=0)
 fig.update_yaxes(matches=None)
 fig.show()
 
 
 # %%
-save_fig(fig, f"{FIGS}/roc-models.svelte")
-save_fig(fig, f"{PDF_FIGS}/roc-models.pdf")
+# save_fig(fig, f"{FIGS}/roc-models-{n_rows}x{n_cols}.svelte")
+save_fig(fig, f"{PDF_FIGS}/roc-models-{n_rows}x{n_cols}.pdf", width=1000, height=400)
 
 
 # %%
@@ -142,6 +146,7 @@
 
 
 # %%
-save_fig(fig, f"{FIGS}/prc-models.svelte")
+save_fig(fig, f"{FIGS}/prc-models-{n_rows}x{n_cols}.svelte")
+save_fig(fig, f"{PDF_FIGS}/prc-models-{n_rows}x{n_cols}.pdf")
 fig.update_yaxes(matches=None)
 fig.show()
diff --git a/scripts/scatter_e_above_hull_models.py b/scripts/scatter_e_above_hull_models.py
@@ -5,6 +5,8 @@
 
 
 # %%
+import math
+
 import numpy as np
 import plotly.express as px
 from pymatviz.utils import add_identity_line, bin_df_cols, save_fig
@@ -118,6 +120,8 @@
 
 # %% plot all models in separate subplots
 domain = (-4, 7)
+n_cols = 4
+n_rows = math.ceil(len(models) / n_cols)
 
 fig = px.scatter(
     df_bin,
@@ -224,7 +228,6 @@
 
 
 # %%
-n_rows, n_cols, *_ = np.array(fig._validate_get_grid_ref(), object).shape
 fig_name = f"each-scatter-models-{n_rows}x{n_cols}"
 save_fig(fig, f"{FIGS}/{fig_name}.svelte")
 save_fig(fig, f"{PDF_FIGS}/{fig_name}.pdf")

diff --git a/site/src/routes/preprint/+page.md b/site/src/routes/preprint/+page.md
@@ -67,7 +67,7 @@ However, using the DFT-relaxed structure as input to CGCNN renders the discovery
 As the name suggests, this work seeks to expand upon the original Matbench suite of property prediction tasks @dunn_benchmarking_2020. By providing a standardized collection of datasets along with canonical cross-validation splits for model evaluation, Matbench helped focus the field of ML for materials, increase comparability across papers and provide a quantitative measure of progress in the field. It aimed to catalyze the field of ML for materials through competition and establishing common goal posts in a similar fashion as ImageNet did for computer vision.
 
 Matbench released a test suite of 13 supervised tasks for different material properties ranging from thermal (formation energy, phonon frequency peak), electronic (band gap), optical (refractive index) to tensile and elastic (bulk and shear moduli).
-They range in size from ~300 to ~132,000 samples and include both DFT and experimental data sources. 4 tasks are composition-only while 9 provide the relaxed crystal structure as input.
+They range in size from ~300 to ~132,000 samples and include both DFT and experimental data sources.
 Importantly, all tasks were exclusively concerned with the properties of known materials.
 We believe a task that simulates a materials discovery campaign by requiring materials stability prediction from unrelaxed structures to be a missing piece here.
 
@@ -107,7 +107,7 @@ To simulate a real discovery campaign, our test set inputs are unrelaxed structu
 
 ## Models
 
-Our initial benchmark release includes 8 models. @Fig:metrics-table includes all models but we focus on the 6 best performers in subsequent figures for visual clarity.
+Our initial benchmark release includes 8 models.
 
 1. **Voronoi+RF** @ward_including_2017 - A random forest trained to map a combination of composition-based Magpie features and structure-based relaxation-invariant Voronoi tessellation features (effective coordination numbers, structural heterogeneity, local environment properties, ...) to DFT formation energies.
 
@@ -163,7 +163,7 @@ Our initial benchmark release includes 8 models. @Fig:metrics-table includes all
 >
 > </details>
 
-@Fig:metrics-table shows performance metrics for all models considered in v1 of our benchmark.
+@Fig:metrics-table shows performance metrics for all models included in the initial release of Matbench Discovery.
 CHGNet takes the top spot on all metrics except true positive rate (TPR) and emerges as current SOTA for ML-guided materials discovery. The discovery acceleration factor (DAF) measures how many more stable structures a model found compared to the dummy discovery rate of 43k / 257k $\approx$ 16.7\% achieved by randomly selecting test set crystals. Consequently, the maximum possible DAF is ~6. This highlights the fact that our benchmark is made more challenging by deploying models on an already enriched space with a much higher fraction of stable structures than uncharted materials space at large. As the convex hull becomes more thoroughly sampled by future discovery, the fraction of unknown stable structures decreases, naturally leading to less enriched future test sets which will allow for higher maximum DAFs.
 
 Note that MEGNet outperforms M3GNet on DAF (2.70 vs 2.66) even though M3GNet is superior to MEGNet in all other metrics. The reason is the one outlined in the previous paragraph as becomes clear from @fig:cumulative-clf-metrics. MEGNet's line ends at 55.6 k materials which is closest to the true number of 43 k stable materials in our test set. All other models overpredict the sum total of stable materials by anywhere from 40% (~59 k for CGCNN) to 104% (85 k for Wrenformer), resulting in large numbers of false positive predictions which lower their DAFs.
@@ -180,7 +180,7 @@ The reason CGCNN+P achieves better regression metrics than CGCNN but is still wo
 <CumulativeClfMetrics style="margin: 0 -2em 0 -4em;" />
 {/if}
 
-> @label:fig:cumulative-clf-metrics Cumulative precision and recall over the course of a simulated discovery campaign. This figure highlights how different models perform better or worse depending on the length of the discovery campaign. Length here is an integer measuring how many DFT relaxations you have compute budget for.
+> @label:fig:cumulative-clf-metrics Cumulative precision and recall over the course of a simulated discovery campaign. This figure highlights how different models perform better or worse depending on the length of the discovery campaign. Length here is an integer measuring how many DFT relaxations you have compute budget for. We only show the 6 best performing models for visual clarity.
 
 @Fig:cumulative-clf-metrics simulates ranking materials from most to least stable according to model-predicted energies. For each model, we go down that list material by material, calculating at each step the precision and recall of correctly identified stable materials. This simulates exactly how these models might be used in a prospective materials discovery campaign and reveal how a model's performance changes as a function of the discovery campaign length, i.e. the amount of resources available to validate model predictions.
 

diff --git a/site/src/routes/preprint/iclr-ml4mat/+page.md b/site/src/routes/preprint/iclr-ml4mat/+page.md
@@ -10,7 +10,7 @@
 
 <summary>
 
-We present a new machine learning (ML) benchmark for materials stability predictions named `Matbench Discovery`. A goal of this benchmark is to highlight the need to focus on metrics that directly measure their utility in prospective discovery campaigns as opposed to analyzing models based on predictive accuracy alone. Our benchmark consists of a task designed to closely simulate the deployment of ML energy models in a high-throughput search for stable inorganic crystals. We explore a wide variety of models covering multiple methodologies ranging from random forests to GNNs, and from one-shot predictors to iterative Bayesian optimizers and interatomic potential-based relaxers. We find M3GNet to achieve the highest F1 score of 0.58 and $R^2$ of 0.59 while MEGNet wins on discovery acceleration factor (DAF) with 2.94. Our results provide valuable insights for maintainers of high throughput materials databases to start using these models as triaging steps to more effectively allocate compute for DFT relaxations.
+We present a new machine learning (ML) benchmark for materials stability predictions named `Matbench Discovery`. A goal of this benchmark is to highlight the need to focus on metrics that directly measure their utility in prospective discovery campaigns as opposed to analyzing models based on predictive accuracy alone. Our benchmark consists of a task designed to closely simulate the deployment of ML energy models in a high-throughput search for stable inorganic crystals. We explore a wide variety of models covering multiple methodologies ranging from random forests to GNNs, and from one-shot predictors to iterative Bayesian optimizers and interatomic potential relaxers. We find M3GNet to achieve the highest F1 score of 0.58 and $R^2$ of 0.59 while MEGNet wins on discovery acceleration factor (DAF) with 2.94. Our results provide valuable insights for maintainers of high throughput materials databases to start using these models as triaging steps to more effectively allocate compute for DFT relaxations.
 
 </summary>
 
@@ -51,7 +51,7 @@ As the name suggests, this work seeks to expand upon the original Matbench suite
 and attempt to accelerate the field similar to what ImageNet did for computer vision.
 
 Matbench released a test suite of 13 supervised tasks for different material properties ranging from thermal (formation energy, phonon frequency peak), electronic (band gap), optical (refractive index) to tensile and elastic (bulk and shear moduli).
-They range in size from ~300 to ~132,000 samples and include both DFT and experimental data sources. 4 tasks are composition-only while 9 provide the relaxed crystal structure as input.
+They range in size from ~300 to ~132,000 samples and include both DFT and experimental data sources.
 Importantly, all tasks were exclusively concerned with the properties of known materials.
 We believe a task that simulates a materials discovery campaign by requiring materials stability predictions from unrelaxed structures to be a missing piece here.
 
@@ -86,7 +86,7 @@ Moreover, to simulate a discovery campaign our test set inputs are unrelaxed str
 
 ## Models
 
-Our initial benchmark release includes 8 models. @Fig:metrics-table includes all models but we focus on the 6 best performers in subsequent figures for visual clarity.
+Our initial benchmark release includes 8 models.
 
 1. **Voronoi+RF** @ward_including_2017 - A random forest trained to map a combination of composition-based Magpie features and structure-based relaxation-invariant Voronoi tessellation features (effective coordination numbers, structural heterogeneity, local environment properties, ...) to DFT formation energies.
 
@@ -108,14 +108,14 @@ Our initial benchmark release includes 8 models. @Fig:metrics-table includes all
 
 > @label:fig:metrics-table Regression and classification metrics for all models tested on our benchmark. The heat map ranges from yellow (best) to blue (worst) performance. DAF = discovery acceleration factor (see text), TPR = true positive rate, TNR = false negative rate, MAE = mean absolute error, RMSE = root mean squared error
 
-@Fig:metrics-table shows performance metrics for all models considered in v1 of our benchmark.
+@Fig:metrics-table shows performance metrics for all models included in the initial release of Matbench Discovery.
 M3GNet takes the top spot on most metrics and emerges as current SOTA for ML-guided materials discovery. The discovery acceleration factor (DAF) measures how many more stable structures a model found among the ones it predicted stable compared to the dummy discovery rate of 43k / 257k $\approx$ 16.7% achieved by randomly selecting test set crystals. Consequently, the maximum possible DAF is ~6. This highlights the fact that our benchmark is made more challenging by deploying models on an already enriched space with a much higher fraction of stable structures over randomly exploring materials space. As the convex hull becomes more thoroughly sampled by future discovery, the fraction of unknown stable structures decreases, naturally leading to less enriched future test sets which will allow for higher maximum DAFs. The reason MEGNet outperforms M3GNet on DAF becomes clear from @fig:cumulative-clf-metrics by noting that MEGNet's line ends closest to the total number of stable materials. The other models overpredict this number, resulting in large numbers of false positive predictions that drag down their DAFs.
 
 {#if browser}
 <RollingMaeVsHullDistModels />
 {/if}
 
-> @label:fig:rolling-mae-vs-hull-dist-models Rolling MAE on the WBM test set as the energy to the convex hull of the MP training set is varied. The white box in the bottom left indicates the size of the rolling window. The highlighted 'triangle of peril' shows where the models are most likely to misclassify structures.
+> @label:fig:rolling-mae-vs-hull-dist-models Rolling MAE on the WBM test set as the energy to the convex hull of the MP training set is varied. The white box in the bottom left indicates the size of the rolling window. The highlighted 'triangle of peril' shows where the models are most likely to misclassify structures. We only show the 6 best performing models for visual clarity. We only show the 6 best performing models for visual clarity.
 
 {#if browser}
 <CumulativeClfMetrics style="margin: 0 -2em 0 -4em;" />