Skip to content

tankylz/sysTEm_dataset

Repository files navigation

Systematically Verified Thermoelectric (sysTEm) Dataset

sysTEm dataset summary
Figure 1: Graphical Summary of the dataset).

Description

This repository contains the Systematically Verified Thermoelectric (sysTEm) Dataset, which is a compilation of experimental thermoelectric (TE) data, comprising information on the composition, temperature, and transport properties. By openly sharing this dataset, we hope to advance informatics research in TE materials, in the hopes of discovering higher performance TE materials.

Most of the data were extracted from experimental works using WebPlotDigitizer. Additionally, the sysTEm is formed from a merger of previous works, namely an updated version of the Materials Research Laboratory (MRL) dataset (original work, the updated dataset) and the Experimentally Synthesized Thermoelectric Materials (ESTM) dataset (paper).

Aside from manual validation methods, the sysTEm dataset was validated systematically, the chief of which utilizing the $zT$ formula (below), with a tolerance of 10% between the calculated value and the extracted value.

$$ zT = \frac{\sigma S^2 T}{\kappa} = \frac{\sigma S^2 T}{\kappa_e + \kappa_l} = \frac{\text{PF}}{\kappa} T $$

Aside from materials screening and discovery, sysTEm dataset can serve as a benchmarking dataset for models trained on other datasets. Those wishing to extend the TE dataset may also find the code used to validate the data, shared in this repository as well, useful. The full methodology can be found in this paper: https://doi.org/10.26434/chemrxiv-2025-4gxmc

How to Cite

If this dataset and accompanying code has been useful for your work, please consider citing the paper, which is currently a preprint:

@misc{Tang_SystematicallyVerifiedExperimental_2025,
  title        = {Systematically Verified Experimental Thermoelectric Dataset for Data-driven Approaches},
  author       = {Tang, Leng Ze and Purdy, Layla and Mohanty, Trupti and Ng, Leonard W. T. and Sparks, Taylor D.},
  year         = {2025},
  month        = aug,
  archiveprefix= {ChemRxiv},
  doi          = {10.26434/chemrxiv-2025-4gxmc},
  url          = {https://chemrxiv.org/engage/chemrxiv/article-details/68aeafd3a94eede154d987bd},
  urldate      = {2025-08-31},
  keywords     = {experimental data, figure of merit, materials informatics, thermoelectric dataset, thermoelectric materials}
}

sysTEm Dataset Columns

For the final dataset (sysTEm_dataset.xlsx), the following columns are presented as follows:

Column Name Data Type Description
# int Unique row identifier in the dataset
Initial Dataset string Source of the data point (e.g., This Work, extended MRL or ESTM)
Source Paper string DOI link or URL to the original publication
Pymatgen Composition string Chemical composition as a string, directly convertible to a pymatgen.Composition object
reduced_compositions string Simplified chemical formula showing the reduced stoichiometric ratio (e.g., Sb₂Si₂Te₆ → SiSbTe₃)
Pretty Formula string Nominal formula extracted and lightly formatted from the source to support parsing via regex and Pymatgen
Type of Formula string Indicates whether the Pretty Formula is Stoichiometric (e.g., Ag₂Se) or a Mixed Formula (e.g., Te + 0.1 wt% InP₃)
Year int Year in which the source paper was published
Temperature (K) float Measurement temperature, in Kelvin
Electrical Conductivity (S/cm) float Electrical conductivity (σ), in S/cm
Seebeck Coefficient (µV/K) float Seebeck coefficient (S), in µV/K
Power Factor (µW/cmK²) float Power factor (PF), in µW/cm·K²
zT float Dimensionless thermoelectric figure of merit
Total Thermal Conductivity (W/mK) float Total thermal conductivity (κ = κₑ + κₗ), in W/m·K
Lattice Thermal Conductivity (W/mK) float Lattice component of thermal conductivity (κₗ), in W/m·K
Electronic Thermal Conductivity (W/mK) float Electronic component of thermal conductivity (κₑ), in W/m·K

How to use the dataset

You may either download the file, sysTEm_dataset.xlsx, which contains the final dataset, or clone the entire repository.

Formatted in .xlsx format, sysTEm is presented as a data table, making it easy to work with existing libraries. Here, we present one example loading the data for further use.

Loading the Dataset in Python

# accurate as of python 3.10.15, pandas 1.5.3, pymatgen 2024.5.1 on M3 Pro MacBook 
import pandas as pd
from pymatgen.core import Composition

df = pd.read_excel('sysTEm_dataset.xlsx') # load the dataset

# convert composition string into pymatgen.core.composition object
composition_col = 'Pymatgen Composition' # or change this to 'reduced_compositions' for reduced formula
df[composition_col] = df[composition_col].map(lambda x: Composition(x))

# continue by featurizing the composition and temperature...

Using the code in this repository

You are recommended to clone the entire repository. The local_pkgs folder contains most of the functions used for validating the data and generating plots.

Downloading the required dependencies

The work originally used conda to install the required dependencies. Here, we also share the installation method using venv. In the event of issues with installation, you may refer to the full_dependencies.txt to see what are the installed packages that we ran (on a M3 Pro MacBook).

If you are using conda, refer to environment.yml for the installation steps; if you are using venv, refer to the requirements.txt for the installation instructions.

Raw Data and Intermediate Work

Important data files can be found in the dataset_checkpoints folder.

The initial extended MRL and ESTM datasets are in the folder with the prefix 'original_', along with intermediate forms of the merged dataset. These intermediate forms are saved at checkpoints starting from 01 to 04.

The manually_removed_indices.xlsx file indicates which entries, identified by a unique integer in #, were removed along with the reasons for their removal.

Additionally, walkthrough.ipynb details the processes used to generate the final dataset, and also the figures for the paper. A copy of the figures are given in the figures folder.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •