Scripts, configuration files, and instructions to create a DNAnexus applet for the Python package stepcount
(https://github.com/OxWearables/stepcount) for use on the DNAnexus platform.
- Python 3.8 or higher
- The DNAnexus
dx
toolkit:pip install dxpy
- git
The following shows how to use Anaconda to satisfy the above prerequisites (you can use any Python environment manager):
- Download & install Miniconda (light-weight version of Anaconda).
- (Windows only) Open the Anaconda Prompt (Start Menu).
- Create a new environment named
dxpy
with Python, Pip, and Git:conda create -n dxpy python=3.9 pip git
- Activate the environment:
You should now see
conda activate dxpy
(dxpy)
at the beginning of your command prompt. - Install
dxpy
:pip install dxpy
🟢 You are now ready! You've created an environment called dxpy
containing the DNAnexus CLI.
🔁 Next time: Just open the Anaconda Prompt and run:
conda activate dxpyIf you see
(dxpy)
in your prompt, you’re good to go.
Log in to DNAnexus:
dx login
Use your regular DNAnexus username/password.
Basic DNAnexus commands (prefixed with dx
) mimic Unix commands:
Command | Meaning |
---|---|
dx ls |
List files/folders |
dx cd |
Change directories |
dx mkdir |
Create a new folder |
dx rm |
Delete a file |
dx mv |
Move or rename a file |
📖 For more: DNAnexus CLI Quickstart
-
Clone this repository:
git clone https://github.com/OxWearables/dnanexus-stepcount.git cd dnanexus-stepcount/
-
Build the asset:
dx build_asset stepcount-asset
⏳ This takes 10–15 minutes and may show warnings—ignore them.
-
When complete, copy the asset ID (e.g.,
record-abc123
). If you missed it:dx describe stepcount-asset
-
Open the file
stepcount/dxapp.json
find this section:"assetDepends": [ { "id": "record-..." } ]
Replace
"record-..."
with the actual asset ID. Save and close. -
Finally, build the applet:
dx build stepcount
To begin, download a sample accelerometer file:
https://wearables-files.ndph.ox.ac.uk/files/data/samples/ax3/tiny-sample.cwa.gz
and upload it to your DNAnexus project: dx upload tiny-sample.cwa.gz
You can now run the applet on the uploaded sample file:
dx run stepcount -iinput_file=tiny-sample.cwa.gz
⏳ This takes 5–10 minutes.
This starts a new job on DNAnexus.
The job ID shown in the output (e.g. job-AbCdE12345
) can be used to track its progress in the DNAnexus web interface under the “Monitor” tab.
Once the job finishes, an outputs/
folder will be created in your project. You can view its contents with dx tree outputs/
which should look like this:
outputs/
└── tiny-sample
├── tiny-sample-Bouts.csv.gz
├── tiny-sample-Daily.csv.gz
├── tiny-sample-DailyAdjusted.csv.gz
├── tiny-sample-Hourly.csv.gz
├── tiny-sample-HourlyAdjusted.csv.gz
├── tiny-sample-Info.json
├── tiny-sample-Minutely.csv.gz
├── tiny-sample-MinutelyAdjusted.csv.gz
├── tiny-sample-Steps.csv.gz
├── tiny-sample-Steps.png
└── tiny-sample-StepTimes.csv.gz
- Error: ('destination project is in region aws:xx-xxxx-x but "regionalOptions" do not contain this region. Please, update your "regionalOptions" specification',)
- Solution: Open stepcount/dxapp.json and search for the
"regionalOptions"
field:Change"regionalOptions": { "aws:eu-west-2": {...} }
"aws:eu-west-2"
to your project region as indicated in your error message.
- Solution: Open stepcount/dxapp.json and search for the
The most straightforward way to process multiple files is to submit one dx run
command per file. The example below shows how to automate this using standard Unix commands (it also works in the Windows Anaconda Prompt).
First, you'll need to generate a list of file paths you want to process. In this example, we're working with UK Biobank accelerometer data (about 100,000 files). We use the dx find data
command to filter by field ID 90001 (UK Biobank ID for accelerometry), and then use awk
to extract just the file paths:
dx find data --property field_id=90001 | awk '{print $6}' > my-files.txt
The resulting my-files.txt
file should contain entries like:
/Bulk/Activity/Raw/54/5408734_90001_1_0.cwa
/Bulk/Activity/Raw/49/4945583_90001_1_0.cwa
/Bulk/Activity/Raw/20/2066665_90001_1_0.cwa
...
Finally, we use xargs
to submit a job for each entry:
xargs -P10 -I {} sh -c 'dx run stepcount -iinput_file=":{}" -y --brief' < my-files.txt | tee my-jobs.txt
This will execute dx run stepcount ...
for each entry in my-files.txt
. It will also create a log file my-jobs.txt
containing the list of submitted job IDs.
For additional batch processing strategies, see the tutorial by the UK Biobank team: https://github.com/UK-Biobank/UKB-RAP-Imaging-ML/blob/main/stepcount-applet/bulk_files_processing.ipynb
If you need to terminate multiple job submissions, the my-jobs.txt
file can be used as follows:
xargs -P10 -I {} sh -c 'dx terminate "{}"' < my-jobs.txt
After running multiple jobs, you may want to merge their output files for further analysis. The stepcount
package includes a secondary CLI tool, stepcount-collate-outputs
, made for this purpose. To use it on DNAnexus, you'll need to create a separate applet (you can reuse the already created stepcount-asset
asset, avoiding the time-consuming asset building process):
-
Open stepcount-collate-outputs/dxapp.json and find this section:
"assetDepends": [ { "id": "record-..." } ]
Replace
"record-..."
with the asset ID you created earlier (i.e.stepcount-asset
). -
Build the applet:
dx build stepcount-collate-outputs
The applet can then be used as follows:
dx run stepcount-collate-outputs -iinput_file=my-outputs.txt
First, create the my-outputs.txt
file listing the IDs of the files you want to collate. We will use dx find data
for this. Assuming the files are in the outputs/
folder, run:
dx find data --path outputs/ --brief > my-outputs.txt
The resulting my-outputs.txt
file will look like this:
project-GXJBY38JZ32Vb0588YVYx3Gy:file-Gx4k9hjJVz2Gb3gkV0p3XfVk
project-GXJBY38JZ32Vb0588YVYx3Gy:file-Gx4k9hjJVz28pPjj9p7vJqkX
project-GXJBY38JZ32Vb0588YVYx3Gy:file-Gx4k9hjJVz2P260x2PjZK0Gy
...
Note that, unlike the my-files.txt
file from the previous section which listed file paths, this one lists file IDs.
Next, upload the list to DNAnexus:
dx upload my-outputs.txt
Finally, run the collate applet on the list:
dx run stepcount-collate-outputs -iinput_file=my-outputs.txt
If you're dealing with hundreds of thousands of output files (e.g. UK Biobank), collating everything may be too slow.
The stepcount
package creates several output types. For example, *-Info.json
files have overall stats, *-Daily.csv
files have daily summaries, and *-Hourly.csv
files show hourly data.
You can speed things up by selecting only the files you need using the --name
option in the dx find data
command. For example, if you only want the *-Info.json
files:
dx find data --path outputs/ --brief --name *-Info.json > only-info-outputs.txt
To ensure reproducibility and follow best practices, we recommend explicitly pinning the version of the stepcount
package in your asset.
-
Open the
stepcount-asset/dxasset.json
file. -
Edit the
"execDepends"
section to include the desired version ofstepcount
. For example, to pin the version to 3.12.0, you would specify:"execDepends": [ {"name": "stepcount", "version": "3.12.0", "package_manager": "pip"}, {...}, ]
-
Save and close the file.
-
Rebuild the asset by running:
dx build_asset stepcount-asset
Find all available versions of stepcount
here:
👉 https://github.com/OxWearables/stepcount/releases