Figure 1: Graphical Summary of the dataset).
This repository contains the Systematically Verified Thermoelectric (sysTEm) Dataset, which is a compilation of experimental thermoelectric (TE) data, comprising information on the composition, temperature, and transport properties. By openly sharing this dataset, we hope to advance informatics research in TE materials, in the hopes of discovering higher performance TE materials.
Most of the data were extracted from experimental works using WebPlotDigitizer. Additionally, the sysTEm is formed from a merger of previous works, namely an updated version of the Materials Research Laboratory (MRL) dataset (original work, the updated dataset) and the Experimentally Synthesized Thermoelectric Materials (ESTM) dataset (paper).
Aside from manual validation methods, the sysTEm dataset was validated systematically, the chief of which utilizing the
Aside from materials screening and discovery, sysTEm dataset can serve as a benchmarking dataset for models trained on other datasets. Those wishing to extend the TE dataset may also find the code used to validate the data, shared in this repository as well, useful. The full methodology can be found in this paper: https://doi.org/10.26434/chemrxiv-2025-4gxmc
If this dataset and accompanying code has been useful for your work, please consider citing the paper, which is currently a preprint:
@misc{Tang_SystematicallyVerifiedExperimental_2025,
title = {Systematically Verified Experimental Thermoelectric Dataset for Data-driven Approaches},
author = {Tang, Leng Ze and Purdy, Layla and Mohanty, Trupti and Ng, Leonard W. T. and Sparks, Taylor D.},
year = {2025},
month = aug,
archiveprefix= {ChemRxiv},
doi = {10.26434/chemrxiv-2025-4gxmc},
url = {https://chemrxiv.org/engage/chemrxiv/article-details/68aeafd3a94eede154d987bd},
urldate = {2025-08-31},
keywords = {experimental data, figure of merit, materials informatics, thermoelectric dataset, thermoelectric materials}
}
For the final dataset (sysTEm_dataset.xlsx), the following columns are presented as follows:
| Column Name | Data Type | Description |
|---|---|---|
# |
int | Unique row identifier in the dataset |
Initial Dataset |
string | Source of the data point (e.g., This Work, extended MRL or ESTM) |
Source Paper |
string | DOI link or URL to the original publication |
Pymatgen Composition |
string | Chemical composition as a string, directly convertible to a pymatgen.Composition object |
reduced_compositions |
string | Simplified chemical formula showing the reduced stoichiometric ratio (e.g., Sb₂Si₂Te₆ → SiSbTe₃) |
Pretty Formula |
string | Nominal formula extracted and lightly formatted from the source to support parsing via regex and Pymatgen |
Type of Formula |
string | Indicates whether the Pretty Formula is Stoichiometric (e.g., Ag₂Se) or a Mixed Formula (e.g., Te + 0.1 wt% InP₃) |
Year |
int | Year in which the source paper was published |
Temperature (K) |
float | Measurement temperature, in Kelvin |
Electrical Conductivity (S/cm) |
float | Electrical conductivity (σ), in S/cm |
Seebeck Coefficient (µV/K) |
float | Seebeck coefficient (S), in µV/K |
Power Factor (µW/cmK²) |
float | Power factor (PF), in µW/cm·K² |
zT |
float | Dimensionless thermoelectric figure of merit |
Total Thermal Conductivity (W/mK) |
float | Total thermal conductivity (κ = κₑ + κₗ), in W/m·K |
Lattice Thermal Conductivity (W/mK) |
float | Lattice component of thermal conductivity (κₗ), in W/m·K |
Electronic Thermal Conductivity (W/mK) |
float | Electronic component of thermal conductivity (κₑ), in W/m·K |
You may either download the file, sysTEm_dataset.xlsx, which contains the final dataset, or clone the entire repository.
Formatted in .xlsx format, sysTEm is presented as a data table, making it easy to work with existing libraries. Here, we present one example loading the data for further use.
# accurate as of python 3.10.15, pandas 1.5.3, pymatgen 2024.5.1 on M3 Pro MacBook
import pandas as pd
from pymatgen.core import Composition
df = pd.read_excel('sysTEm_dataset.xlsx') # load the dataset
# convert composition string into pymatgen.core.composition object
composition_col = 'Pymatgen Composition' # or change this to 'reduced_compositions' for reduced formula
df[composition_col] = df[composition_col].map(lambda x: Composition(x))
# continue by featurizing the composition and temperature...You are recommended to clone the entire repository. The local_pkgs folder contains most of the functions used for validating the data and generating plots.
The work originally used conda to install the required dependencies. Here, we also share the installation method using venv. In the event of issues with installation, you may refer to the full_dependencies.txt to see what are the installed packages that we ran (on a M3 Pro MacBook).
If you are using conda, refer to environment.yml for the installation steps; if you are using venv, refer to the requirements.txt for the installation instructions.
Important data files can be found in the dataset_checkpoints folder.
The initial extended MRL and ESTM datasets are in the folder with the prefix 'original_', along with intermediate forms of the merged dataset. These intermediate forms are saved at checkpoints starting from 01 to 04.
The manually_removed_indices.xlsx file indicates which entries, identified by a unique integer in #, were removed along with the reasons for their removal.
Additionally, walkthrough.ipynb details the processes used to generate the final dataset, and also the figures for the paper. A copy of the figures are given in the figures folder.