Skip to content

Update JUMP profile URLs to new structure #30

Open
@shntnu

Description

@shntnu

This repository may contain references to JUMP profile data that need to be updated to reflect the new directory structure.

Context

The JUMP Cell Painting profiles have been reorganized to a new, cleaner structure. See jump-cellpainting/datasets#155 for details.

Required Changes

Your repository may contain references to the old profile paths that need to be updated:

Old → New Path Mappings

  • /workspace/profiles/jump-profiling-recipe_2024_a917fa7/ORF/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony.parquet/workspace/profiles_assembled/ORF/v1.0a/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony.parquet

  • /workspace/profiles/jump-profiling-recipe_2024_a917fa7/ORF/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony/profiles_wellpos_cc_var_mad_outlier.parquet/workspace/profiles_assembled/ORF/v1.0a/profiles_wellpos_cc_var_mad_outlier.parquet

  • /workspace/profiles/jump-profiling-recipe_2024_a917fa7/CRISPR/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected.parquet/workspace/profiles_assembled/CRISPR/v1.0a/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected.parquet

  • /workspace/profiles/jump-profiling-recipe_2024_a917fa7/CRISPR/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected/profiles_wellpos_cc_var_mad_outlier.parquet/workspace/profiles_assembled/CRISPR/v1.0a/profiles_wellpos_cc_var_mad_outlier.parquet

  • /workspace/profiles/jump-profiling-recipe_2024_a917fa7/COMPOUND/profiles_var_mad_int_featselect_harmony/profiles_var_mad_int_featselect_harmony.parquet/workspace/profiles_assembled/COMPOUND/v1.0/profiles_var_mad_int_featselect_harmony.parquet

  • /workspace/profiles/jump-profiling-recipe_2024_a917fa7/COMPOUND/profiles_var_mad_int_featselect_harmony/profiles_var_mad_int.parquet/workspace/profiles_assembled/COMPOUND/v1.0/profiles_var_mad_int.parquet

  • /workspace/profiles/jump-profiling-recipe_2024_0224e0f/ALL/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony.parquet/workspace/profiles_assembled/ALL/v1.0b/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony.parquet

  • /workspace/profiles/jump-profiling-recipe_2024_0224e0f/ALL/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony/profiles_wellpos_cc_var_mad_outlier_featselect.parquet/workspace/profiles_assembled/ALL/v1.0b/profiles_wellpos_cc_var_mad_outlier_featselect.parquet

Update Script

The following AWK script by @afermg provides a more comprehensive solution that handles all profile paths generically:

Create a file named update_cpg_location.awk:

# Update the paths of cpg files
# /workspace/profiles/jump-profiling-recipe_2024_a917fa7/ORF/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony.parquet
# Is converted to
# /workspace/profiles_assembled/ORF/v1.0a/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony.parquet

BEGIN {
    pattern = "/workspace/profiles/jump-profiling-recipe_2024_[a-z0-9]{7}/([A-Z]+)/.+/(.+[.]parquet)";
}

{
    if (match($0, pattern, captures)){
        version_name = "v1.0";
        if (captures[1]=="ORF" || captures[1]=="CRISPR"){
            version_name = version_name "a";
        };
        
        if (captures[1]=="ALL"){
            version_name = version_name "b";
        };
        replacement = "/workspace/profiles_assembled/" captures[1] "/" version_name "/" captures[2];
        gsub(pattern,replacement);
    };
    print $0
}

To update all relevant files in your codebase:

# Find and update all files containing old profile paths
rg "workspace/profiles/jump-profiling-recipe_2024" -t py -t json -t md -t sh -t org -t csv -t nix -l | xargs awk -i inplace -f update_cpg_location.awk

Note for macOS users: You'll need GNU awk for this script. Install it with brew install gawk and use gawk instead of awk in the command above.

This command:

  • Uses ripgrep (rg) to find files containing the old paths
  • -t selects specific file formats
  • -l provides a list of files only
  • awk -i inplace modifies files in place

Important: After running the AWK script, always review the changes with git diff to ensure the transformations were applied correctly. The script handles most cases, but edge cases or typos in the original paths may require manual adjustment.

Additional Note

If your repository also references manifests/profile_index.csv, note that the format has changed from CSV to JSON. See jump-cellpainting/datasets#152 and jump-cellpainting/datasets#155 for details.

Action Required

Please update your code to use the new profile paths. The old paths will be deprecated.

Feel free to reach out if you have any questions or need assistance with the migration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions