Skip to content

Commit 65026fa

Browse files
committed
Add scripts for new data sources
1 parent 2185b38 commit 65026fa

6 files changed

+493
-75
lines changed

README.md

+41-6
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,13 @@ You can use the provided training set for the `training_data` and `holdout_data`
4949

5050
## How do I create data for these scripts?
5151

52-
You can use the scripts in this repository to convert the [CODE-15% dataset](https://zenodo.org/records/4916206) to [WFDB](https://wfdb.io/) format. These instructions use `code15_hdf5` as the path for the input data files and `code15_wfdb` for the output data files, but you can replace them with the absolute or relative paths for the files on your machine.
52+
You can use the scripts in this repository to convert the [CODE-15% dataset](https://zenodo.org/records/4916206), the [SaMi-Trop dataset](https://zenodo.org/records/4905618), and the [PTB-XL dataset](https://physionet.org/content/ptb-xl/) to [WFDB](https://wfdb.io/) format.
53+
54+
Please see the [data](https://physionetchallenges.org/2025/#data) section of the website for more information about the Challenge data.
55+
56+
#### CODE-15% dataset
57+
58+
These instructions use `code15_input` as the path for the input data files and `code15_output` for the output data files, but you can replace them with the absolute or relative paths for the files on your machine.
5359

5460
1. Download and unzip one or more of the `exam_part` files and the `exams.csv` file in the [CODE-15% dataset](https://zenodo.org/records/4916206).
5561

@@ -58,13 +64,42 @@ You can use the scripts in this repository to convert the [CODE-15% dataset](htt
5864
3. Convert the CODE-15% dataset to WFDB format, with the available demographics information and Chagas labels in the WFDB header file, by running
5965

6066
python prepare_code15_data.py \
61-
-i code15_hdf5/exams_part0.hdf5 code15_hdf5/exams_part1.hdf5 \
62-
-d code15_hdf5/exams.csv \
63-
-l code15_hdf5/code15_chagas_labels.csv \
64-
-o code15_wfdb
67+
-i code15_input/exams_part0.hdf5 code15_input/exams_part1.hdf5 \
68+
-d code15_input/exams.csv \
69+
-l code15_input/code15_chagas_labels.csv \
70+
-o code15_output
6571

6672
Each `exam_part` file in the [CODE-15% dataset](https://zenodo.org/records/4916206) contains approximately 20,000 ECG recordings. You can include more or fewer of these files to increase or decrease the number of ECG recordings, respectively. You may want to start with fewer ECG recordings to debug your code.
6773

74+
#### SaMi-Trop dataset
75+
76+
These instructions use `samitrop_input` as the path for the input data files and `samitrop_output` for the output data files, but you can replace them with the absolute or relative paths for the files on your machine.
77+
78+
1. Download and unzip `exams.zip` file and the `exams.csv` file in the [SaMi-Trop dataset](https://zenodo.org/records/4905618).
79+
80+
2. Download and unzip the Chagas labels, i.e., the [`samitrop_chagas_labels.csv`](https://physionetchallenges.org/2025/data/samitrop_chagas_labels.zip) file.
81+
82+
3. Convert the SaMi-Trop dataset to WFDB format, with the available demographics information and Chagas labels in the WFDB header file, by running
83+
84+
python prepare_samitrop_data.py \
85+
-i samitrop_input/exams.hdf5 \
86+
-d samitrop_input/exams.csv \
87+
-l samitrop_input/samitrop_chagas_labels.csv \
88+
-o samitrop_output
89+
90+
#### PTB-XL dataset
91+
92+
These instructions use `ptbxl_input` as the path for the input data files and `ptbxl_output` for the output data files, but you can replace them with the absolute or relative paths for the files on your machine. We are using the `records500` folder, which has a 500Hz sampling frequency, but you can also try the `records100` folder, which has a 100Hz sampling frequency.
93+
94+
1. Download and, if necessary, unzip the [PTB-XL dataset](https://physionet.org/content/ptb-xl/).
95+
96+
2. Update the WFDB files with the available demographics information and Chagas labels by running
97+
98+
python prepare_ptbxl_data.py \
99+
-i ptbxl_input/records500/ \
100+
-d ptbxl_input/ptbxl_database.csv \
101+
-o ptbxl_output
102+
68103
## Which scripts I can edit?
69104

70105
Please edit the following script to add your code:
@@ -122,7 +157,7 @@ If you have trouble running your code, then please try the follow steps to run t
122157
user@computer:~/example/python-example-2025$ docker run -it -v ~/example/model:/challenge/model -v ~/example/holdout_data:/challenge/holdout_data -v ~/example/holdout_outputs:/challenge/holdout_outputs -v ~/example/training_data:/challenge/training_data image bash
123158

124159
root@[...]:/challenge# ls
125-
Dockerfile holdout_outputs run_mode.py
160+
Dockerfile holdout_outputs run_model.py
126161
evaluate_model.py LICENSE training_data
127162
helper_code.py README.md
128163
holdout_data requirements.txt

helper_code.py

-1
Original file line numberDiff line numberDiff line change
@@ -481,4 +481,3 @@ def sanitize_boolean_value(x):
481481
return 1
482482
else:
483483
return float('nan')
484-

prepare_code15_data.py

+48-65
Original file line numberDiff line numberDiff line change
@@ -6,19 +6,20 @@
66
import numpy as np
77
import os
88
import os.path
9+
import pandas as pd
910
import sys
1011
import wfdb
1112

1213
from helper_code import is_integer, is_boolean, sanitize_boolean_value
1314

1415
# Parse arguments.
1516
def get_parser():
16-
description = 'Prepare the CODE-15 database.'
17+
description = 'Prepare the CODE-15% dataset for the Challenge.'
1718
parser = argparse.ArgumentParser(description=description)
18-
parser.add_argument('-i', '--signal_files', type=str, required=True, nargs='*')
19-
parser.add_argument('-f', '--signal_format', type=str, required=False, default='dat', choices=['dat', 'mat'])
20-
parser.add_argument('-d', '--demographics_file', type=str, required=True)
21-
parser.add_argument('-l', '--label_file', type=str, required=True)
19+
parser.add_argument('-i', '--signal_files', type=str, required=True, nargs='*') # exams_part0.hdf5, exams_part1.hdf5, ...
20+
parser.add_argument('-d', '--demographics_file', type=str, required=True) # exams.csv
21+
parser.add_argument('-l', '--labels_file', type=str, required=True) # code15_chagas_labels.csv
22+
parser.add_argument('-f', '--signal_format', type=str, required=False, default='dat', choices=['dat', 'mat'])
2223
parser.add_argument('-o', '--output_path', type=str, required=True)
2324
return parser
2425

@@ -29,7 +30,7 @@ def suppress_stdout():
2930
with open(os.devnull, 'w') as devnull:
3031
stdout = sys.stdout
3132
sys.stdout = devnull
32-
try:
33+
try:
3334
yield
3435
finally:
3536
sys.stdout = stdout
@@ -39,8 +40,8 @@ def convert_dat_to_mat(record, write_dir=None):
3940
import wfdb.io.convert
4041

4142
# Change the current working directory; wfdb.io.convert.matlab.wfdb_to_matlab places files in the current working directory.
42-
cwd = os.getcwd()
4343
if write_dir:
44+
cwd = os.getcwd()
4445
os.chdir(write_dir)
4546

4647
# Convert the .dat file to a .mat file.
@@ -75,7 +76,7 @@ def convert_dat_to_mat(record, write_dir=None):
7576
# Fix the checksums from the Python WFDB library.
7677
def fix_checksums(record, checksums=None):
7778
if checksums is None:
78-
x = wfdb.rdrecord(record, physical=False)
79+
x = wfdb.rdrecord(record, physical=False)
7980
signals = np.asarray(x.d_signal)
8081
checksums = np.sum(signals, axis=0, dtype=np.int16)
8182

@@ -98,56 +99,39 @@ def fix_checksums(record, checksums=None):
9899
# Run script.
99100
def run(args):
100101
# Load the patient demographic data.
101-
exam_id_to_patient_id = dict()
102102
exam_id_to_age = dict()
103103
exam_id_to_sex = dict()
104104

105-
with open(args.demographics_file, 'r') as f:
106-
for i, l in enumerate(f):
107-
arrs = [arr.strip() for arr in l.split(',')]
108-
if i == 0:
109-
idx_exam_id = arrs.index('exam_id')
110-
idx_patient_id = arrs.index('patient_id')
111-
idx_age = arrs.index('age')
112-
idx_is_male = arrs.index('is_male')
113-
else:
114-
exam_id = arrs[idx_exam_id]
115-
assert(is_integer(exam_id))
116-
exam_id = int(exam_id)
117-
118-
patient_id = arrs[idx_patient_id]
119-
assert(is_integer(patient_id))
120-
patient_id = int(patient_id)
121-
exam_id_to_patient_id[exam_id] = patient_id
122-
123-
age = arrs[idx_age]
124-
assert(is_integer(age))
125-
age = int(age)
126-
exam_id_to_age[exam_id] = age
127-
128-
is_male = arrs[idx_is_male]
129-
assert(is_boolean(is_male))
130-
is_male = sanitize_boolean_value(is_male)
131-
sex = 'Male' if is_male else 'Female' # This variable was encoding as a binary value.
132-
exam_id_to_sex[exam_id] = sex
133-
134-
# Load the Chagas labels.
135-
exam_id_to_chagas = dict()
136-
137-
with open(args.label_file, 'r') as f:
138-
for i, l in enumerate(f):
139-
arrs = [arr.strip() for arr in l.split(',')]
140-
if i == 0:
141-
idx_exam_id = arrs.index('exam_id')
142-
idx_chagas = arrs.index('chagas')
143-
else:
144-
exam_id = arrs[idx_exam_id]
145-
assert(is_integer(exam_id))
146-
exam_id = int(exam_id)
147-
148-
chagas = arrs[idx_chagas]
149-
chagas = sanitize_boolean_value(chagas)
150-
exam_id_to_chagas[exam_id] = bool(chagas)
105+
df = pd.read_csv(args.demographics_file)
106+
for idx, row in df.iterrows():
107+
exam_id = row['exam_id']
108+
assert(is_integer(exam_id))
109+
exam_id = int(exam_id)
110+
111+
age = row['age']
112+
assert(is_integer(age))
113+
age = int(age)
114+
exam_id_to_age[exam_id] = age
115+
116+
is_male = row['is_male']
117+
assert(is_boolean(is_male))
118+
is_male = sanitize_boolean_value(is_male)
119+
sex = 'Male' if is_male else 'Female' # This variable was encoding as a binary value.
120+
exam_id_to_sex[exam_id] = sex
121+
122+
# Load the Chagas labels.
123+
exam_id_to_chagas = dict()
124+
125+
df = pd.read_csv(args.labels_file)
126+
for idx, row in df.iterrows():
127+
exam_id = row['exam_id']
128+
assert(is_integer(exam_id))
129+
exam_id = int(exam_id)
130+
131+
chagas = row['chagas']
132+
assert(is_boolean(chagas))
133+
chagas = sanitize_boolean_value(chagas)
134+
exam_id_to_chagas[exam_id] = bool(chagas)
151135

152136
# Load and convert the signal data.
153137

@@ -156,7 +140,7 @@ def run(args):
156140
sampling_frequency = 400
157141
units = 'mV'
158142

159-
# Define the paramters for the WFDB files.
143+
# Define the paramaters for the WFDB files.
160144
gain = 1000
161145
baseline = 0
162146
num_bits = 16
@@ -179,7 +163,7 @@ def run(args):
179163
continue
180164
else:
181165
pass
182-
166+
183167
physical_signals = np.array(f['tracings'][i], dtype=np.float32)
184168

185169
# Perform basic error checking on the signal.
@@ -207,25 +191,24 @@ def run(args):
207191
digital_signals[~np.isfinite(digital_signals)] = -2**(num_bits-1)
208192
digital_signals = np.asarray(digital_signals, dtype=np.int32) # We need to promote from 16-bit integers due to an error in the Python WFDB library.
209193

210-
# Add the exam ID, the patient ID, age, sex, and the Chagas label.
211-
patient_id = exam_id_to_patient_id[exam_id]
194+
# Add the exam ID, age, sex, and the Chagas label.
212195
age = exam_id_to_age[exam_id]
213196
sex = exam_id_to_sex[exam_id]
214197
chagas = exam_id_to_chagas[exam_id]
215-
comments = [f'Exam ID: {exam_id}', f'Patient ID: {patient_id}', f'Age: {age}', f'Sex: {sex}', f'Chagas label: {chagas}']
216-
198+
comments = [f'Age: {age}', f'Sex: {sex}', f'Chagas label: {chagas}']
199+
217200
# Save the signal.
218201
record = str(exam_id)
219-
wfdb.wrsamp(record, fs=sampling_frequency, units=[units]*num_leads, sig_name=lead_names,
202+
wfdb.wrsamp(record, fs=sampling_frequency, units=[units]*num_leads, sig_name=lead_names,
220203
d_signal=digital_signals, fmt=[fmt]*num_leads, adc_gain=[gain]*num_leads, baseline=[baseline]*num_leads,
221204
write_dir=args.output_path, comments=comments)
222205

223-
if args.signal_format == 'mat':
206+
if args.signal_format in ('mat', '.mat'):
224207
convert_dat_to_mat(record, write_dir=args.output_path)
225208

226-
# Recompute the checksums for the checksum due to an error in the Python WFDB library.
209+
# Recompute the checksums as needed.
227210
checksums = np.sum(digital_signals, axis=0, dtype=np.int16)
228211
fix_checksums(os.path.join(args.output_path, record), checksums)
229212

230213
if __name__=='__main__':
231-
run(get_parser().parse_args(sys.argv[1:]))
214+
run(get_parser().parse_args(sys.argv[1:]))

0 commit comments

Comments
 (0)