Skip to content

joegilkes/ASESurfaceFinder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ASESurfaceFinder

PyPI - Version Conda Version

A utility for determining surface facets and adsorption points of ASE-based systems consisting of molecules on surfaces.

ASE comes with an excellent selection of utilities for working with atomic surfaces, enabling the construction of many common surface facets, the definition of high-symmetry points across these surfaces, and the adsorption of arbitrary molecules to these surface sites. However, computationally determining which of these sites an adsorbed molecule is bound to without prior knowledge is a non-trivial task, due to the symmetric equivalence of many sites and high similarity of many surface facets.

ASESurfaceFinder implements automated tools for training and validating random forest classification models (implemented in scikit-learn) that can identify surface sites based on the local atomic environment of adsorbed atoms. Given unseen adsorbed systems, it then enables these models to be used for prediction of both surface facet and high-symmetry adsorption site, to be used when cataloguing externally-generated adsorbed systems.

Installation

ASESurfaceFinder has been developed for Python 3.9 and newer. It is available on PyPI and conda-forge, using their respective package managers:

# Pip (PyPI)
pip install asesurfacefinder

# Conda (conda-forge)
conda install -c conda-forge asesurfacefinder

Usage

Given a workflow that produces XYZ geometries of molecules adsorbed on periodic solid surfaces in unknown locations, ASESurfaceFinder can be used to quickly categorise the adsorption sites by the surface's available high-symmetry points. For example, assuming the workflow can produce surface/adsorbate systems where the surface is one of the following (a 3x3x3 unit cell is shown):

Example Surface ASE Construction Visualisation (3x3x3)
Platinum FCC{100} fcc100('Pt', (3,3,3)) 3x3x3 unit cell of FCC{100} platinum
Silver FCC{110} fcc110('Ag', (3,3,3)) 3x3x3 unit cell of FCC{110} silver
Gold FCC{111} fcc111('Au', (3,3,3)) 3x3x3 unit cell of FCC{111} gold

ASE defines the symmetrically equivalent adsorption sites for each of these surface facets (see ASE's documentation on surface construction). These are used as classification targets within ASESurfaceFinder.

To begin, the package's main class is initialised using these surfaces:

from asesurfacefinder import SurfaceFinder, SampleBounds
from ase.build import fcc100, fcc110, fcc111

surfaces = [fcc100('Pt', (1,1,3)), fcc110('Ag', (1,1,3)), fcc111('Au', (1,1,3))]
surface_labels = ['Pt_fcc100', 'Ag_fcc110', 'Au_fcc111']
bounds = [
    {
        'ontop': SampleBounds(0.6, 1.5, 2.4),
        'bridge': SampleBounds(0.45, z_min=0.75, z_mid=1.4, z_max=2.0),
        'hollow': SampleBounds(0.5, z_min=0.4, z_mid=0.8, z_max=1.75)
    },
    {
        'ontop': SampleBounds(0.6, 1.5, 2.4),
        'hollow': SampleBounds(0.65, z_min=0.1, z_mid=0.65, z_max=1.3),
        'longbridge': SampleBounds(0.5, z_min=-0.6, z_mid=1.0, z_max=1.5),
        'shortbridge': SampleBounds(0.5, z_min=0.6, z_mid=1.75, z_max=2.1)
    },
    {
        'ontop': SampleBounds(0.6, 1.5, 2.4),
        'bridge': SampleBounds(0.35, z_min=0.7, z_mid=1.6, z_max=2.2),
        'fcc': SampleBounds(0.35, z_min=0.4, z_mid=1.4, z_max=2.0),
        'hcp': SampleBounds(0.35, z_min=0.4, z_mid=1.4, z_max=2.0)
    }
]

sf = SurfaceFinder(surfaces, labels=surface_labels, sample_bounds=bounds)

The labels argument is optional, and can be used for more precisely naming the surface facets; without it, they will be referred to by integers.

The sample_bounds argument is similarly optional, and allows for specification of the volume above each named surface site that should be sampled. Each dict corresponds to its respective surface, and each site can take its own bounds (see next section). In the event that some sites are not specified, this volume falls back to a default set of bounds, which can be controlled with the sample_defaults argument to the SurfaceFinder class (defaults to SampleBounds(0.1, 1.0, 2.75)).

Important

While surfaces outside of those that can be generated by ASE can be trained on, they must contain an 'adsorbate_info' dictionary internally that specifies their minimal XY unit cell and their high-symmetry adsorption sites.

Model Training

To train a random forest classification model to recognise the adsorption sites on each of these surfaces, the train method of this class can be called:

sf.train(
    samples_per_site=10000,
    n_jobs=4
)

For each surface facet passed into SurfaceFinder, this samples positions above each adsorption site samples_per_site times for possible perturbations and adsorption heights. The volume that can be sampled is defined by a SampleBounds object, which is passed in on construction of the SurfaceFinder class:

# SampleBounds can take 3 positional arguments: r_max, z_min and z_max.
# This defines a hemispheroidal sampling volume with radius 0.5 Ang, 
# a base plane 1.5 Ang above the highest surface atom and an apex 
# 2.4 Ang above the highest surface atom.
hemispheroid_bounds = SampleBounds(0.5, 1.5, 2.4)

# It can take these as keyword arguments too, with an optional 4th 
# argument: z_mid.
# This defines an ovoid sampling volume with maximum radius 0.3 Ang 
# at its middle plane, which sits 1.4 Ang above the highest surface
# atom. Top and bottom apexes of the volume lie at 2.3 Ang and 
# 1.0 Ang above the highest surface atom respectively.
ovoid_bounds = SampleBounds(r_max=0.3, z_min=1.0, z_mid=1.4, z_max=2.3)

Note

See the next section for examples of how these sampling volumes look.

The local atomic environment around each sampled adsorbate position is then encoded by means of a descriptor from DScribe - a SOAP descriptor with a cutoff of 10 Å is used by default, but this can be modified by passing a descriptor keyword argument when constructing the SurfaceFinder. Each descriptor is matched with a label representing the surface facet and adsorption site that it represents in the format {label}_{site}, where {label} is one of the labels passed into SurfaceFinder and {site} is the ASE high-symmetry site name. The n_jobs argument enables parallelism over the specified number of processes during descriptor creation and model training.

A random forest classifier is then trained on this data, such that any future points above a surface can be classified into a surface facet and adsorption site prediction.

Sample Visualisation

Possible realisations of the adsorption point sampling can be visualised with ASESurfaceFinder's SamplePlotter:

from asesurfacefinder import SamplePlotter
surface = fcc111('Au', (2,2,3))
sp = SamplePlotter(surface, samples_per_site=2000, sample_bounds={
    'ontop': SampleBounds(0.6, 1.5, 2.4),
    'bridge': SampleBounds(0.35, z_min=0.7, z_mid=1.6, z_max=2.2),
    'fcc': SampleBounds(0.35, z_min=0.4, z_mid=1.4, z_max=2.0),
    'hcp': SampleBounds(0.35, z_min=0.4, z_mid=1.4, z_max=2.0)
})

fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(12,6), layout='constrained')
ax1.set_axis_off(); ax2.set_axis_off(); ax3.set_axis_off()
sp.plot(ax=ax1, rotation='0x,0y,0z')
sp.plot(ax=ax2, rotation='-75x,0y,0z')
sp.plot(ax=ax3, rotation='-90x,-155y,0z')

This acts as a wrapper around ase.visualize.plot's plot_atoms function, sampling adsorption points within the given bounds and displaying them on the requested surface. SamplePlotter.plot() passes through keyword arguments to this function, allowing for properties such as the rotation of each plot to be modified:

Adsorption points sampled above high-symmetry sites on a gold FCC{111} surface

Here we can see the difference between the types of sampling volume defined by SampleBounds: the red ontop site sits over a single atom and is defined by a hemispheroidal dome, while the other sites sit in between atoms and permit atoms to sit in cavities within the top of the surface, so they use an ovoid sampling volume that is widest at the most common absorption heights and tapers to a point above and below this.

When defining SampleBounds for unknown surface sites, SamplePlotter can be used to fine-tune the sampling volume. To choose sensible initial values for a site think about where the smallest and largest atoms you will be adsorbing might lie - in a simple hydrocarbon, hydrogen would be able to approach sites more closely than carbon due to its smaller atomic radius. Choosing the correct sampling bounds will vary depending on the chemical system and how the input geometries for prediction are being created and optimised, so this exercise is left up to the reader.

Important

Visualisation is an important step, as it allows you to ensure that sampling volumes are not overlapping! Overlapping volumes are likely to confuse the classifier during training, decreasing prediction accuracy. Watch out for overlap between symmetrically equivalent sites - SamplePlotter does not currently show these.

Model Validation

The training can be validated before performing any predictions on real systems:

sf.validate(samples_per_site=2000, surf_mults=[(1,1,1), (3,3,1), (5,5,1)], sample_bounds=[
    {'bridge': SampleBounds(0.5, z_min=0.65, z_mid=1.4, z_max=2.2)},
    {},
    {'fcc': SampleBounds(0.3, z_min=0.5, z_mid=1.4, z_max=1.9),
    'hcp': SampleBounds(0.3, z_min=0.5, z_mid=1.4, z_max=1.9)}
])

This performs the same adsorbate position sampling procedure as was used to generate local atomic environments in the training of the classifier, but different values of the sampling volume can be passed through the sample_bounds argument to test the limits of applicability of the current model. While a dict has to be provided for each surface (when using this argument), surface sites not given new validation bounds will default to using the bounds that were used in training.

In this case, since we have expanded the sample volume of the bridge site on Pt_fcc100 it is likely that there will be a few failed predictions, as the predictor will have to extrapolate to local environments outside the height and XY bounds that it was trained on. Conversely, we have reduced the volume for the fcc and hcp sites of Au_fcc111, so validation on these sites should be completely accurate. It can therefore be useful to set the training bounds to a slightly larger volume than you might expect, and decrease accordingly to realistic bounds when validating.

Additionally, the surf_mults argument can be used to scale up the unit cell of each surface facet before sampling to ensure that adsorbates on larger surface slabs are correctly predicted. In this example, since the provided facets were all 1x1x3 unit cells, this will produce sampled adsorbate positions on 1x1x3, 3x3x3 and 5x5x3 surface slabs.

ASESurfaceFinder will output a summary of these validation results once it has predicted labels for these newly generated positions and compared them with their true values. If the classifier has been trained correctly, this will simply be something like

66000/66000 sites classified correctly (accuracy = 1.0).

However, if this validation fails for any sites, ASESurfaceFinder prints them in a table where h is the sampled adsorbate height and r is the radial distance in Angstroms away from the ideal adsorption site:

63559/66000 sites classified correctly (accuracy = 0.9630151515151515).

True                                            | Predicted
-------------------------------------------------------------------
Au_fcc111_bridge (5, 5, 1) (h = 2.48, r = 0.44) | Au_fcc111_hcp
Au_fcc111_bridge (5, 5, 1) (h = 1.42, r = 0.67) | Au_fcc111_ontop
Au_fcc111_bridge (5, 5, 1) (h = 1.60, r = 0.42) | Au_fcc111_hcp
Au_fcc111_fcc (5, 5, 1) (h = 2.43, r = 0.52)    | Au_fcc111_bridge
Au_fcc111_fcc (5, 5, 1) (h = 2.40, r = 0.63)    | Au_fcc111_bridge
Au_fcc111_fcc (5, 5, 1) (h = 1.98, r = 0.54)    | Au_fcc111_bridge
...

This indicates that either

  1. the training samples were not varied enough to account for the sample space covered by the validation samples,
  2. the validation samples covered too wide an area and overlapped between neighbouring sites, causing samples to be mislabelled,
  3. the training samples accounted for too wide a sample space, causing overlap between neighbouring sites and misclassification.

In a production use case, each site's SampleBounds should be tweaked such that validation returns no errors while sampling as broad and realistic a volume above each adsorption site as possible.

Model Prediction

To predict the high-symmetry adsorption sites and surface facets of real surface/adsorbate systems, these can simply be fed into a trained SurfaceFinder.

Taking an example system of 3 methanol molecules adsorbed to 'fcc' sites of gold FCC{111}, with a free methanol molecule above the surface:

Four methanol molecules on gold FCC{111}

The adsorbates and gas phase molecule can be separated from the surface and classified by their adsorption sites (or lack thereof) with the SurfaceFinder.predict() method:

from ase.io import read

real_surface = read('examples/4MeO_Au_fcc111.xyz')
slab, molecules, labels = sf.predict(real_surface)

This returns three outputs:

  1. slab: The clean surface geometry as an ase.Atoms, without any adsorbates.
  2. molecules: A list of ase.Atoms representing the adsorbates and gas phase molecules found on and above the surface respectively.
  3. labels: A list of dictionaries corresponding to the entries in molecules containing the atoms of the respective adsorbate that are adsorbed onto the surface, their predicted sites and the adsorption heights above those sites.

For example, given the above system:

for molecule, label in zip(molecules, labels):
    print(molecule, label)
Atoms(symbols='OCH3', pbc=False, tags=...) {0: {'site': 'Au_fcc111_fcc', 'bonded_elem': 'O', 'coordination': 3, 'height': 1.7941429100000015}}
Atoms(symbols='OCH3', pbc=False, tags=...) {0: {'site': 'Au_fcc111_fcc', 'bonded_elem': 'O', 'coordination': 3, 'height': 1.79542644}}
Atoms(symbols='OCH3', pbc=False, tags=...) {0: {'site': 'Au_fcc111_fcc', 'bonded_elem': 'O', 'coordination': 3, 'height': 1.7944157000000018}}
Atoms(symbols='CHOH3', pbc=False, tags=...) {}

The first three entries are the adsorbed methanol molecules, correctly classified as being bound to the 'fcc' high-symmetry site on a gold FCC{111} surface. Each molecule's label dictionary is keyed by the index of the atom in the corresponding molecule which is bound; in the event that a molecule is bound to the surface by multiple atoms, it will contain en entry for each. In this case, all adsorbates are bound by their respective 0th atoms - their oxygens, indicated by the 'bonded_elem' key.

Additionally, each label also provides the coordination of the adsorption site as detected by ASE's connectivity tools. Since the 'fcc' site sits in between three atoms, this is reflected in the 'coordination' key.

The fourth molecule returned is the gas phase methanol, which is not adsorbed onto the surface. This is reflected in its label dictionary, which is empty to indicate no atoms are bound.

Adjusting connectivity cutoffs

SurfaceFinder identifies atoms to run for prediction by searching for connectivity to the surface with ASE's NeighborList tools. When using more approximate semiempirical or forcefield methods to construct adsorbed systems, atoms can be less tightly bound to surfaces and partially or entirely fail these connectivity checks, leading to adsorbed atoms being tagged incorrectly.

ASE therefore allows for passing different cutoff distances to these connectivity checks, such that the 'collision volume' of each atom can be expanded. This is usually done through the natural_cutoffs function, which accepts a scaling factor mult. We allow a set of cutoffs (which can even be manually defined) to be passed through during site prediction.

The following represents a more complex example with different adsorbates bound on a range of sites, optimised with a semiempirical method that can underbind adsorbates:

A selection of molecules on a range of gold FCC{111} adsorption sites

Running the same prediction workflow with slightly expanded NeighborList cutoffs reveals carbon monoxide on an 'ontop' site, hydroxide on a 'fcc' site, and methylamine on a 'bridge' site:

from ase.neighborlist import natural_cutoffs

underbound_surface = read('examples/CO+H2O+NHCH3_Au_fcc111.xyz')
cutoffs = natural_cutoffs(underbound_surface, mult=1.15)
slab, molecules, labels = sf.predict(underbound_surface, nl_cutoffs=cutoffs)

for molecule, label in zip(molecules, labels):
    print(molecule, label)
Atoms(symbols='CO', pbc=False) {0: {'site': 'Au_fcc111_ontop', 'bonded_elem': 'C', 'coordination': 1, 'height': 2.4249690500000014}}
Atoms(symbols='OH', pbc=False) {0: {'site': 'Au_fcc111_fcc', 'bonded_elem': 'O', 'coordination': 3, 'height': 1.7710107300000004}}
Atoms(symbols='NHCH3', pbc=False) {0: {'site': 'Au_fcc111_bridge', 'bonded_elem': 'N', 'coordination': 2, 'height': 1.9164665700000008}}

Rejecting near-surface atoms by coordination

In some cases, adsorbates can contain atoms which lie close enough to the surface to be picked up by ASE's connectivity detection, but should not be considered to be adsorbed. This usually occurs to hydrogen atoms that are directly bonded to adsorbed atoms. In such cases, it can be beneficial to classify these atoms as not adsorbed and instead rely on a single point of adsorption. Take this ethylamino anion on Pt FCC{100}, where a hydrogen atom sits above a bridge site:

Ethylamino anion on Pt FCC{100}, with low-lying hydrogen atom potentially interacting with the surface.

Identifying such atoms can be tricky, although when sent for site prediction they usually come back undercoordinated. Depending on how SurfaceFinder is trained, this molecule can return the following labels:

{
    2: {'site': 'Pt_fcc100_ontop', 'bonded_elem': 'N', 'coordination': 1, 'height': 1.9993630699999994}, 
    8: {'site': 'Pt_fcc100_bridge', 'bonded_elem': 'H', 'coordination': 1, 'height': 2.0844453000000005}
}

As FCC{100} bridge sites are supposed to be 2-coordinated, we know this atom is undercoordinated. However, this is not necessarily grounds for rejection - occasionally, multiple atoms can share the same surface site and have the wrong coordination, and rejecting all of these would lead to predictions of unadsorbed molecules. Rejection of an adsorbed atom therefore only occurs if it is directly bonded to an adsorbed atom with the correct coordination, such as this hydrogen bonded to the 1-coordinate ontop site here.

This behaviour has to be opted into if desired using the reject_wrong_coordination argument of SurfaceFinder.predict():

undercoordinated_h_surface = read('examples/CH3CH2NH_Pt_fcc100.xyz')
slab, molecules, labels = sf.predict(undercoordinated_h_surface, reject_wrong_coordination=True)

for molecule, label in zip(molecules, labels):
    print(molecule, label)
Atoms(symbols='C2NH6', pbc=False, tags=...) {2: {'site': 'Pt_fcc100_ontop', 'bonded_elem': 'N', 'coordination': 1, 'height': 1.9993630699999994}}

Sometimes ASESurfaceFinder will not find the correct coordination for some sites, as it calculates this by placing a hydrogen atom at the bottom of the site's sampling volume and checking its connectivity to the surface. In particularly wide hollows such as the hollow site in FCC{110}, the small scale of the hydrogen atom can lead to predicted undercoordination. This can be identified by the printout at SurfaceFinder initialisation, and corrected by passing a value to SurfaceFinder's site_coordinations keyword argument:

sf = SurfaceFinder([fcc110('Ag', (1,1,3))], labels=['Ag_fcc110'], site_coordinations=[{'hollow': 4}])

Rejecting near-surface hydrogens

Depending on how SurfaceFinder is trained, the above hydrogen may instead be predicted to be adsorbed to the same ontop site that the nitrogen is on! Rejecting the hydrogen through undercoordination to its surface site will fail, since it is on a 1-coordinate site:

{
    2: {'bonded_elem': 'N', 'coordination': 1, 'height': 1.9993630699999994, 'site': 'Pt_fcc100_ontop'}, 
    8: {'bonded_elem': 'H', 'coordination': 1, 'height': 2.0844453000000005, 'site': 'Pt_fcc100_ontop'}
}

Instead, we can take the more aggressive option of rejecting all hydrogen atoms that are detected as adsorbed, provided they are directly bonded to a non-hydrogen adsorbed atom. This behaviour can be enabled with the reject_bonded_hydrogens argument of SurfaceFinder.predict() (defaults to False). The distinction of rejected hydrogens being bonded to non-hydrogen atoms is important, as otherwise hydrogen molecules on the surface could be rejected entirely if they are in a horizontal conformation.

A note on adsorbate identification

ASESurfaceFinder does not currently contain a nice, clean way of separating all adsorbates from surfaces. In the first instance, it searches for a set of per-atom 'tags' in the underlying ase.Atoms object (accessed by atoms.get_tags()). These are generated by ASE when using its adsorbate-adding functionality, and represent the layer of the surface that each atom is part of - adsorbates are tagged at layer 0, with increasing layer tags indicating increasing depth.

However, it is unlikely that input surface/adsorbate systems will be tagged like this since if they were generated with ASE, you would already know the sites on which the adsorbates sit. Instead, ASESurfaceFinder currently relies on separation by element. Since it knows which elements the surfaces are made of, it will simply mask these elements off to separate the adsorbates from the surface. A warning will be displayed when this is performed.

Since the majority of surface/adsorbate systems are organic molecules on metal surfaces, this approach will work most of the time. However, when dealing with surfaces that contain the same elements as adsorbates (e.g. diamond, silica, etc.), this approach will fail. A more concrete approach for adsorbate separation is being developed, but will likely require analysis of the training surfaces as a whole.

About

Random forest-based classification of surface facets and high-symmetry adsorption sites.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages