Skip to content

Commit

Permalink
make horizontal versions of roc-models-2x4.pdf and model-run-times-ba…
Browse files Browse the repository at this point in the history
…r.pdf

add MACE and matlantis refs
  • Loading branch information
janosh committed Jun 20, 2023
1 parent 4b9da09 commit 551050e
Show file tree
Hide file tree
Showing 10 changed files with 334 additions and 53 deletions.
3 changes: 2 additions & 1 deletion matbench_discovery/structure.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@


def perturb_structure(struct: Structure, gamma: float = 1.5) -> Structure:
"""Perturb the atomic coordinates of a pymatgen structure.
"""Perturb the atomic coordinates of a pymatgen structure. Used for CGCNN+P
training set augmentation.
Args:
struct (Structure): pymatgen structure to be perturbed
Expand Down
2 changes: 1 addition & 1 deletion readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
Matbench Discovery is an [interactive leaderboard](https://janosh.github.io/matbench-discovery/models) and associated [PyPI package](https://pypi.org/project/matbench-discovery) which together make it easy to rank ML energy models on a task designed to closely simulate a high-throughput discovery campaign for new stable inorganic crystals.

So far, we've tested 8 models covering multiple methodologies ranging from random forests with structure fingerprints to graph neural networks, from one-shot predictors to iterative Bayesian optimizers and interatomic potential-based relaxers. We find [CHGNet](https://github.com/CederGroupHub/chgnet) ([paper](https://doi.org/10.48550/arXiv.2302.14231)) to achieve the highest F1 score of 0.59, $R^2$ of 0.61 and a discovery acceleration factor (DAF) of 3.06 (meaning a 3x higher rate of stable structures compared to dummy selection in our already enriched search space). We believe our results show that ML models have become robust enough to deploy them as triaging steps to more effectively allocate compute in high-throughput DFT relaxations. This work provides valuable insights for anyone looking to build large-scale materials databases.
So far, we've tested 8 models covering multiple methodologies ranging from random forests with structure fingerprints to graph neural networks, from one-shot predictors to iterative Bayesian optimizers and interatomic potential relaxers. We find [CHGNet](https://github.com/CederGroupHub/chgnet) ([paper](https://doi.org/10.48550/arXiv.2302.14231)) to achieve the highest F1 score of 0.59, $R^2$ of 0.61 and a discovery acceleration factor (DAF) of 3.06 (meaning a 3x higher rate of stable structures compared to dummy selection in our already enriched search space). We believe our results show that ML models have become robust enough to deploy them as triaging steps to more effectively allocate compute in high-throughput DFT relaxations. This work provides valuable insights for anyone looking to build large-scale materials databases.

<slot name="metrics-table" />

Expand Down
32 changes: 16 additions & 16 deletions scripts/analyze_element_errors.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,22 @@
)


# %% map average model error onto elements
frac_comp_col = "fractional composition"
df_wbm[frac_comp_col] = [
Composition(comp).fractional_composition for comp in tqdm(df_wbm.formula)
]

df_frac_comp = pd.DataFrame(comp.as_dict() for comp in df_wbm[frac_comp_col]).set_index(
df_wbm.index
)
assert all(
df_frac_comp.sum(axis=1).round(6) == 1
), "composition fractions don't sum to 1"

# df_frac_comp = df_frac_comp.dropna(axis=1, thresh=100) # remove Xe with only 1 entry


# %%
df_mp = pd.read_csv(DATA_FILES.mp_energies, na_filter=False).set_index("material_id")
# compute number of samples per element in training set
Expand All @@ -50,22 +66,6 @@
fig.show()


# %% map average model error onto elements
frac_comp_col = "fractional composition"
df_wbm[frac_comp_col] = [
Composition(comp).fractional_composition for comp in tqdm(df_wbm.formula)
]

df_frac_comp = pd.DataFrame(comp.as_dict() for comp in df_wbm[frac_comp_col]).set_index(
df_wbm.index
)
assert all(
df_frac_comp.sum(axis=1).round(6) == 1
), "composition fractions don't sum to 1"

# df_frac_comp = df_frac_comp.dropna(axis=1, thresh=100) # remove Xe with only 1 entry


# %%
for label, srs in (
("MP", df_elem_err[train_count_col]),
Expand Down
33 changes: 25 additions & 8 deletions scripts/calc_wandb_model_runtimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@

import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import requests
import wandb
import wandb.apis.public
Expand Down Expand Up @@ -122,9 +123,6 @@
print(f"{df_stats[time_col].sum()=:.0f} hours")

# df_stats.round(2).to_json(f"{MODELS}/model-stats.json", orient="index")


# %% plot model run times as pie chart
df_time = (
df_stats.sort_index()
.filter(like=time_col)
Expand All @@ -134,6 +132,9 @@
# .drop(index="BOWSR + MEGNet")
.reset_index(names=(model_col := "Model"))
)


# %% plot model run times as pie chart
fig = px.pie(
df_time,
values=time_col,
Expand Down Expand Up @@ -179,18 +180,34 @@

# %% plot model run times as bar chart
fig = df_melt.dropna().plot.bar(
y=time_col,
x=model_col,
x=time_col,
y=model_col,
backend="plotly",
# color=time_col,
text_auto=".0f",
text=time_col,
color=model_col,
)
title = f"Total: {df_stats[time_col].sum():.0f} h"
# reduce bar width
fig.update_traces(width=0.7)

title = f"All models: {df_stats[time_col].sum():.0f} h"
fig.layout.legend.update(x=0.98, y=0.98, xanchor="right", yanchor="top", title=title)
fig.layout.xaxis.title = ""
fig.layout.margin.update(l=0, r=0, t=0, b=0)
save_fig(fig, f"{FIGS}/model-run-times-bar.svelte")
save_fig(fig, f"{PDF_FIGS}/model-run-times-bar.pdf")
# save_fig(fig, f"{FIGS}/model-run-times-bar.svelte")

pdf_fig = go.Figure(fig)
# replace legend with annotation in PDF
pdf_fig.layout.showlegend = False
pdf_fig.add_annotation(
text=title,
font=dict(size=15),
x=0.99,
y=0.99,
showarrow=False,
xref="paper",
yref="paper",
)
save_fig(pdf_fig, f"{PDF_FIGS}/model-run-times-bar.pdf", height=300, width=800)
fig.show()
33 changes: 19 additions & 14 deletions scripts/prc_roc_curves_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@


# %%
import math

import pandas as pd
from pymatviz.utils import save_fig
from sklearn.metrics import auc, precision_recall_curve, roc_curve
Expand All @@ -23,6 +25,9 @@
facet_col = "Model"
color_col = "Stability Threshold"

n_cols = 4
n_rows = math.ceil(len(models) // n_cols)


# %%
df_roc = pd.DataFrame()
Expand Down Expand Up @@ -50,7 +55,7 @@
x="FPR",
y="TPR",
facet_col=facet_col,
facet_col_wrap=2,
facet_col_wrap=4,
backend="plotly",
height=150 * len(df_roc[facet_col].unique()),
color=color_col,
Expand All @@ -59,33 +64,32 @@
range_color=(-0.5, 0.5),
hover_name=facet_col,
hover_data={facet_col: False},
facet_col_spacing=0.03,
facet_row_spacing=0.1,
)
)

for anno in fig.layout.annotations:
anno.text = anno.text.split("=", 1)[1] # remove Model= from subplot titles

fig.layout.coloraxis.colorbar.update(
x=1,
y=1,
xanchor="right",
yanchor="top",
thickness=14,
lenmode="pixels",
len=210,
title_side="right",
)
fig.layout.coloraxis.colorbar.update(thickness=14, title_side="right")
if n_cols == 2:
fig.layout.coloraxis.colorbar.update(
x=1, y=1, xanchor="right", yanchor="top", lenmode="pixels", len=210
)

fig.add_shape(type="line", x0=0, y0=0, x1=1, y1=1, line=line, row="all", col="all")
fig.add_annotation(text="No skill", x=0.5, y=0.5, showarrow=False, yshift=-10)
# allow scrolling and zooming each subplot individually
fig.update_xaxes(matches=None)
fig.layout.margin.update(l=0, r=0, b=0, t=20, pad=0)
fig.update_yaxes(matches=None)
fig.show()


# %%
save_fig(fig, f"{FIGS}/roc-models.svelte")
save_fig(fig, f"{PDF_FIGS}/roc-models.pdf")
# save_fig(fig, f"{FIGS}/roc-models-{n_rows}x{n_cols}.svelte")
save_fig(fig, f"{PDF_FIGS}/roc-models-{n_rows}x{n_cols}.pdf", width=1000, height=400)


# %%
Expand Down Expand Up @@ -142,6 +146,7 @@


# %%
save_fig(fig, f"{FIGS}/prc-models.svelte")
save_fig(fig, f"{FIGS}/prc-models-{n_rows}x{n_cols}.svelte")
save_fig(fig, f"{PDF_FIGS}/prc-models-{n_rows}x{n_cols}.pdf")
fig.update_yaxes(matches=None)
fig.show()
5 changes: 4 additions & 1 deletion scripts/scatter_e_above_hull_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@


# %%
import math

import numpy as np
import plotly.express as px
from pymatviz.utils import add_identity_line, bin_df_cols, save_fig
Expand Down Expand Up @@ -118,6 +120,8 @@

# %% plot all models in separate subplots
domain = (-4, 7)
n_cols = 4
n_rows = math.ceil(len(models) / n_cols)

fig = px.scatter(
df_bin,
Expand Down Expand Up @@ -224,7 +228,6 @@


# %%
n_rows, n_cols, *_ = np.array(fig._validate_get_grid_ref(), object).shape
fig_name = f"each-scatter-models-{n_rows}x{n_cols}"
save_fig(fig, f"{FIGS}/{fig_name}.svelte")
save_fig(fig, f"{PDF_FIGS}/{fig_name}.pdf")
Expand Down
8 changes: 4 additions & 4 deletions site/src/routes/preprint/+page.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ However, using the DFT-relaxed structure as input to CGCNN renders the discovery
As the name suggests, this work seeks to expand upon the original Matbench suite of property prediction tasks @dunn_benchmarking_2020. By providing a standardized collection of datasets along with canonical cross-validation splits for model evaluation, Matbench helped focus the field of ML for materials, increase comparability across papers and provide a quantitative measure of progress in the field. It aimed to catalyze the field of ML for materials through competition and establishing common goal posts in a similar fashion as ImageNet did for computer vision.

Matbench released a test suite of 13 supervised tasks for different material properties ranging from thermal (formation energy, phonon frequency peak), electronic (band gap), optical (refractive index) to tensile and elastic (bulk and shear moduli).
They range in size from ~300 to ~132,000 samples and include both DFT and experimental data sources. 4 tasks are composition-only while 9 provide the relaxed crystal structure as input.
They range in size from ~300 to ~132,000 samples and include both DFT and experimental data sources.
Importantly, all tasks were exclusively concerned with the properties of known materials.
We believe a task that simulates a materials discovery campaign by requiring materials stability prediction from unrelaxed structures to be a missing piece here.

Expand Down Expand Up @@ -107,7 +107,7 @@ To simulate a real discovery campaign, our test set inputs are unrelaxed structu

## Models

Our initial benchmark release includes 8 models. @Fig:metrics-table includes all models but we focus on the 6 best performers in subsequent figures for visual clarity.
Our initial benchmark release includes 8 models.

1. **Voronoi+RF** @ward_including_2017 - A random forest trained to map a combination of composition-based Magpie features and structure-based relaxation-invariant Voronoi tessellation features (effective coordination numbers, structural heterogeneity, local environment properties, ...) to DFT formation energies.

Expand Down Expand Up @@ -163,7 +163,7 @@ Our initial benchmark release includes 8 models. @Fig:metrics-table includes all
>
> </details>
@Fig:metrics-table shows performance metrics for all models considered in v1 of our benchmark.
@Fig:metrics-table shows performance metrics for all models included in the initial release of Matbench Discovery.
CHGNet takes the top spot on all metrics except true positive rate (TPR) and emerges as current SOTA for ML-guided materials discovery. The discovery acceleration factor (DAF) measures how many more stable structures a model found compared to the dummy discovery rate of 43k / 257k $\approx$ 16.7\% achieved by randomly selecting test set crystals. Consequently, the maximum possible DAF is ~6. This highlights the fact that our benchmark is made more challenging by deploying models on an already enriched space with a much higher fraction of stable structures than uncharted materials space at large. As the convex hull becomes more thoroughly sampled by future discovery, the fraction of unknown stable structures decreases, naturally leading to less enriched future test sets which will allow for higher maximum DAFs.

Note that MEGNet outperforms M3GNet on DAF (2.70 vs 2.66) even though M3GNet is superior to MEGNet in all other metrics. The reason is the one outlined in the previous paragraph as becomes clear from @fig:cumulative-clf-metrics. MEGNet's line ends at 55.6 k materials which is closest to the true number of 43 k stable materials in our test set. All other models overpredict the sum total of stable materials by anywhere from 40% (~59 k for CGCNN) to 104% (85 k for Wrenformer), resulting in large numbers of false positive predictions which lower their DAFs.
Expand All @@ -180,7 +180,7 @@ The reason CGCNN+P achieves better regression metrics than CGCNN but is still wo
<CumulativeClfMetrics style="margin: 0 -2em 0 -4em;" />
{/if}

> @label:fig:cumulative-clf-metrics Cumulative precision and recall over the course of a simulated discovery campaign. This figure highlights how different models perform better or worse depending on the length of the discovery campaign. Length here is an integer measuring how many DFT relaxations you have compute budget for.
> @label:fig:cumulative-clf-metrics Cumulative precision and recall over the course of a simulated discovery campaign. This figure highlights how different models perform better or worse depending on the length of the discovery campaign. Length here is an integer measuring how many DFT relaxations you have compute budget for. We only show the 6 best performing models for visual clarity.
@Fig:cumulative-clf-metrics simulates ranking materials from most to least stable according to model-predicted energies. For each model, we go down that list material by material, calculating at each step the precision and recall of correctly identified stable materials. This simulates exactly how these models might be used in a prospective materials discovery campaign and reveal how a model's performance changes as a function of the discovery campaign length, i.e. the amount of resources available to validate model predictions.

Expand Down
10 changes: 5 additions & 5 deletions site/src/routes/preprint/iclr-ml4mat/+page.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
<summary>

We present a new machine learning (ML) benchmark for materials stability predictions named `Matbench Discovery`. A goal of this benchmark is to highlight the need to focus on metrics that directly measure their utility in prospective discovery campaigns as opposed to analyzing models based on predictive accuracy alone. Our benchmark consists of a task designed to closely simulate the deployment of ML energy models in a high-throughput search for stable inorganic crystals. We explore a wide variety of models covering multiple methodologies ranging from random forests to GNNs, and from one-shot predictors to iterative Bayesian optimizers and interatomic potential-based relaxers. We find M3GNet to achieve the highest F1 score of 0.58 and $R^2$ of 0.59 while MEGNet wins on discovery acceleration factor (DAF) with 2.94. Our results provide valuable insights for maintainers of high throughput materials databases to start using these models as triaging steps to more effectively allocate compute for DFT relaxations.
We present a new machine learning (ML) benchmark for materials stability predictions named `Matbench Discovery`. A goal of this benchmark is to highlight the need to focus on metrics that directly measure their utility in prospective discovery campaigns as opposed to analyzing models based on predictive accuracy alone. Our benchmark consists of a task designed to closely simulate the deployment of ML energy models in a high-throughput search for stable inorganic crystals. We explore a wide variety of models covering multiple methodologies ranging from random forests to GNNs, and from one-shot predictors to iterative Bayesian optimizers and interatomic potential relaxers. We find M3GNet to achieve the highest F1 score of 0.58 and $R^2$ of 0.59 while MEGNet wins on discovery acceleration factor (DAF) with 2.94. Our results provide valuable insights for maintainers of high throughput materials databases to start using these models as triaging steps to more effectively allocate compute for DFT relaxations.

</summary>

Expand Down Expand Up @@ -51,7 +51,7 @@ As the name suggests, this work seeks to expand upon the original Matbench suite
and attempt to accelerate the field similar to what ImageNet did for computer vision.

Matbench released a test suite of 13 supervised tasks for different material properties ranging from thermal (formation energy, phonon frequency peak), electronic (band gap), optical (refractive index) to tensile and elastic (bulk and shear moduli).
They range in size from ~300 to ~132,000 samples and include both DFT and experimental data sources. 4 tasks are composition-only while 9 provide the relaxed crystal structure as input.
They range in size from ~300 to ~132,000 samples and include both DFT and experimental data sources.
Importantly, all tasks were exclusively concerned with the properties of known materials.
We believe a task that simulates a materials discovery campaign by requiring materials stability predictions from unrelaxed structures to be a missing piece here.

Expand Down Expand Up @@ -86,7 +86,7 @@ Moreover, to simulate a discovery campaign our test set inputs are unrelaxed str

## Models

Our initial benchmark release includes 8 models. @Fig:metrics-table includes all models but we focus on the 6 best performers in subsequent figures for visual clarity.
Our initial benchmark release includes 8 models.

1. **Voronoi+RF** @ward_including_2017 - A random forest trained to map a combination of composition-based Magpie features and structure-based relaxation-invariant Voronoi tessellation features (effective coordination numbers, structural heterogeneity, local environment properties, ...) to DFT formation energies.

Expand All @@ -108,14 +108,14 @@ Our initial benchmark release includes 8 models. @Fig:metrics-table includes all

> @label:fig:metrics-table Regression and classification metrics for all models tested on our benchmark. The heat map ranges from yellow (best) to blue (worst) performance. DAF = discovery acceleration factor (see text), TPR = true positive rate, TNR = false negative rate, MAE = mean absolute error, RMSE = root mean squared error
@Fig:metrics-table shows performance metrics for all models considered in v1 of our benchmark.
@Fig:metrics-table shows performance metrics for all models included in the initial release of Matbench Discovery.
M3GNet takes the top spot on most metrics and emerges as current SOTA for ML-guided materials discovery. The discovery acceleration factor (DAF) measures how many more stable structures a model found among the ones it predicted stable compared to the dummy discovery rate of 43k / 257k $\approx$ 16.7% achieved by randomly selecting test set crystals. Consequently, the maximum possible DAF is ~6. This highlights the fact that our benchmark is made more challenging by deploying models on an already enriched space with a much higher fraction of stable structures over randomly exploring materials space. As the convex hull becomes more thoroughly sampled by future discovery, the fraction of unknown stable structures decreases, naturally leading to less enriched future test sets which will allow for higher maximum DAFs. The reason MEGNet outperforms M3GNet on DAF becomes clear from @fig:cumulative-clf-metrics by noting that MEGNet's line ends closest to the total number of stable materials. The other models overpredict this number, resulting in large numbers of false positive predictions that drag down their DAFs.

{#if browser}
<RollingMaeVsHullDistModels />
{/if}

> @label:fig:rolling-mae-vs-hull-dist-models Rolling MAE on the WBM test set as the energy to the convex hull of the MP training set is varied. The white box in the bottom left indicates the size of the rolling window. The highlighted 'triangle of peril' shows where the models are most likely to misclassify structures.
> @label:fig:rolling-mae-vs-hull-dist-models Rolling MAE on the WBM test set as the energy to the convex hull of the MP training set is varied. The white box in the bottom left indicates the size of the rolling window. The highlighted 'triangle of peril' shows where the models are most likely to misclassify structures. We only show the 6 best performing models for visual clarity. We only show the 6 best performing models for visual clarity.
{#if browser}
<CumulativeClfMetrics style="margin: 0 -2em 0 -4em;" />
Expand Down
Loading

0 comments on commit 551050e

Please sign in to comment.