Simpleaf workflow utility library#

To ease the development of simpleaf workflow templates, the simpleaf team provides not only some built-in variables, but also a workflow utility library, which will be automatically passed to the internal Jsonnet engine of simpleaf when parsing a workflow template as the __utils external variable. One can receive this variable in their templates by adding utils=std.extVar("__utils"), and use the functions in the utility library by calling utils.function_name(args), where function_name should be replaced by an actual function name listed below. To be consistent with the Jsonnet official documentation, here we will list the function signatures with a brief description. One can find the function definitions on this page.

Import the utility library#

As the built-in variables are provided by simpleaf to its internal Jsonnet engine, they will be unavailable if we want to parse the template directly using Jsonnet or Jrsonnet. Therefore, when debugging templates that utilize the utility library with an external Jsonnet engine, we must manually provide the __utils external variable to Jsonnet and either copy and paste the library file to the same directory as the template file, which is the default library searching path when calling jsonnet, or provide the directory containing the library as an additional library searching path. Although in the following code chunk, we show the code for both ways, we only need to select one in practice. Here we assume that simpleaf has been configured correctly, i.e., the ALEVIN_FRY_HOME environment variable has been set and a local copy of the protocol-estuary exists. If not, one can directly obtain the library file from its GitHub repository.

# If we haven't set up simpleaf, we pull the file directly from github
wget https://github.com/COMBINE-lab/protocol-estuary/blob/main/utils/simpleaf_workflow_utils.libsonnet

# Otherwise, we either copy the library file to the same dir as the template
copy $ALEVIN_FRY_HOME/protocol-estuary/protocol-estuary-main/utils/simpleaf_workflow_utils.libsonnet .

# or provide the directory as an additional library searching path via --jpath
jsonnet a_template_using_utils_lib.jsonnet --ext-code '__utils=import "simpleaf_workflow_utils.libsonnet"' --jpath "$ALEVIN_FRY_HOME/protocol-estuary/protocol-estuary-main/utils"

where --ext-code is the flag for passing an external variable, and --jpath specifies the library searching path.

Although simpleaf automatically provides the utility library as the external variable __utils, we must receive this external variable in our template before starting using the functions provided in this library.

To do this, we recommend adding the following code at the beginning of your workflow template.

local utils=std.extVar("__utils");

utils.ref_type(o)#

Input:

  • o: an object with
    • a type field and

    • an object field with the name specified by the type field. Other fields will be ignored. For example, {type: "spliceu", spliceu: {gtf: "genes.gtf", fasta: "genome.fa", "field_being_ignored": "ignore me"}}.

Output:

  • An object with the simpleaf index arguments that are related to the specified reference type in the input object.

This function has four modes (reference types), triggered by the type field in the input object. When specifying a mode, the input object must contain an object field named by that mode and contain the required fields. Otherwise, an error will be raised. The four modes are:

  • spliceu (spliced+unspliced reference): The required fields are:
    • gtf: A string representing the path to a gene annotation GTF file.

    • fasta: A string representing the path to a reference genome FASTA file.

  • splici (spliced+intronic reference): The required fields are:
    • gtf: A string representing the path to a gene annotation GTF file.

    • fasta: A string representing the path to a reference genome FASTA file.

    • rlen: An optional field representing the read length in the dataset. If not provided, the default value, 91, will be used.

  • direct_ref: The required fields are:
    • ref_seq: A string representing the path to a transcriptome FASTA file.

    • t2g_map: A string representing the path to a transcript-to-gene mapping file.

  • existing_index: The required fields are:
    • index: A string representing the path to an existing index directory.

    • t2g_map: A string representing the path to a transcript-to-gene mapping file.

Wrapper functions: We also provide separate functions for each of the four modes, utils.splici, utils.spliceu, utils.direct_ref, and utils.existing_index, which are thin wrappers of utils.ref_type. These four functions take an object containing their required fields introduced above.

Example Usage

# import the utility library
local utils=std.extVar("__utils");

local splici_args = {
    gtf : "genes.gtf",
    fasta : "genome.fa",
    rlen : 91
};

local ref_type = utils.ref_type({
    type : "splici",
    splici : splici_args
});

local splici = utils.splici(splici_args);

In the above example, the objects ref_type and splici are identical and look like the following:

{
    # hidden, system fields
    type :: "splici", # hidden field
    arguments :: {gtf : "genes.gtf", fasta : "genome.fa", rlen : 91}, # hidden field

    # fields shown in the manifest
    "--ref-type" : "splici",
    "--fasta" : "genome.fa",
    "--gtf" : "genes.gtf",
    "--rlen" : 91
}

utils.simpleaf_index(step, ref_type, arguments, output)#

Input:

  • step: An integer indicating the step number (execution order) of this simpleaf command record in the workflow.

  • ref_type: A ref_type object returned by calling utils.ref_type or any object with the same format.

  • arguments: An object in which each field represents a simpleaf index argument. Furthermore, there must be a field called active representing the active state of this simpleaf index command.

  • output: A string representing the output directory of the simpleaf index command.

Output:

  • A well-defined simpleaf index command record.

Example Usage

# import the utility library
local utils=std.extVar("__utils");

local splici_args = {
    gtf : "genes.gtf",
    fasta : "genome.fa",
    rlen : 91
};

local splici = utils.splici(splici_args);

local arguments = {
    active : true,
    "--use-piscem" : true
};

local simpleaf_index = utils.simpleaf_index(
    1, # step number
    splici, # ref_type,
    arguments,
    "./simpleaf_index" # output directory
);

The simpleaf_index object in the above code chunk will be

{
    # hidden, system fields
    ref_type :: {}, # hidden field. The actual contents are omitted. see above example code for function `ref_type`
    arguments :: {active : true, "--use-piscem" : true},  # hidden field
    output :: "./simpleaf_index", # hidden field
    index :: "./simpleaf_index/index", # hidden field
    t2g_map :: "./simpleaf_index/index/t2g_3col.tsv", # hidden field

    # fields shown in in the manifest
    program_name : "simpleaf index",
    step : 1,
    active : true,
    "--output": "./workflow_output/simpleaf_index",
    "--gtf" : "genes.gtf",
    "--fasta" : "genome.fa",
    "--rlen" : 91,
    "--use-piscem" : true
}

utils.map_type(o, simpleaf_index = {})#

Input:

  • o: an object with
    • a type field, and

    • an object field with the name specified by the type field. Other fields will be ignored. For example, {"type": "map_reads", "map_reads": {"reads1": null, "reads2": null}, "field_being_ignored": "ignore me"}.

  • simpleaf_index: An empty object if in existing_mappings mode, or the output object of the simpleaf_index function if in map_reads mode. The default value is an empty object.

Output:

  • An object with the simpleaf quant arguments that are related to the specified map type in the input object.

This function has two modes (map types), triggered by the type field in the input object. When specifying a mode, the input object must contain an object field named by that mode and contain the required fields. Otherwise, an error will be raised. The two modes are:

  • map_reads: Map reads against the provided index or an index built from a previous step. The required fields are
    • reads1: A string representing the path to a gene annotation GTF file,

    • reads2: A string representing the path to a reference genome FASTA file.

  • existing_mappings: Skip mapping and use the existing mapping results. The required fields are
    • map_dir: A string representing the path to the mapping result directory,

    • t2g_map: A string representing the path to a transcript-to-gene mapping file.

Wrapper functions: We also provide separate functions for each of the two modes, utils.map_reads and utils.existing_mappings, which are thin wrappers of utils.map_type. These two functions take an object containing their required fields introduced above.

Example Usage

# import the utility library
local utils=std.extVar("__utils");

local simpleaf_index = {}; # The return of object of simpleaf_index function in its example usage

local map_reads_args = {
    reads1 : "reads1.fastq",
    reads2 : "reads2.fastq"
};

local map_type = utils.map_type({
    type : "map_reads",
    map_reads : map_reads_args
});

local map_reads = utils.map_reads(map_reads_args);

In the above example, the objects map_type and map_reads are identical and look like the following:

{
    # hidden, system fields
    type :: "map_reads", # hidden field
    arguments :: {reads1 : "reads1.fastq", reads2 : "reads2.fastq"}, # hidden field

    # fields shown in the manifest
    "--index" : "./workflow_output/simpleaf_index/index",
    "--t2g-map": "./workflow_output/simpleaf_index/index/t2g_3col.tsv",
    "--reads1" : "reads1.fastq",
    "--reads2" : "reads2.fastq"
}

utils.cell_filt_type(o)#

Input:

  • o: an object with
    • a type field, and

    • an argument field with the name specified by the type field. Other fields will be ignored. For example, {"type": "explicit_pl", "explicit_pl": "whitelist.txt", "field_being_ignored": "ignore me"}

Output:

  • An object with the simpleaf quant arguments that are related to the specified cell filtering type in the input object.

This function has five modes (cell filtering types), triggered by the type field in the input object. When specifying a mode, the input object must contain an object field named by that mode and contain the required fields. Otherwise, an error will be raised. For more details, please refer to the online documentation of simpleaf quant and alevin-fry. The five modes are:

  • unfiltered_pl: No cell filtering but correcting cell barcodes by an external or default (only works for 10X Chromium V2 and V3). The corresponding argument value field can be true (using the default whitelist if in 10xv2 and 10xv3 chemistry), or a string representing the path to an unfiltered permit list file.

  • knee: Knee point-based filtering. The corresponding argument value field must be true if selected.

  • forced: Use a forced number of cells. The corresponding argument field must be an integer representing the number of cells that can pass the filtering.

  • expect: Use the expected number of cells. The corresponding argument field must be an integer representing the expected number of cells.

  • explicit_pl: Use a filtered, explicit permit list. The corresponding argument field must be a string representing the path to a cell barcode permit list file.

Wrapper functions: We also provide a separate function for each mode, utils.unfiltered_pl, utils.knee, utils.forced, utils.expect, and utils.explicit_pl, which are thin wrappers of utils.cell_filt_type. These functions take an object containing their required fields introduced above.

Example Usage

# import the utility library
local utils=std.extVar("__utils");

local unfiltered_pl_args = {
    unfiltered_pl : true
};

local cell_filt_type = utils.cell_filt_type({
    type : "unfiltered_pl",
    unfiltered_pl : unfiltered_pl_args
});

local unfiltered_pl = utils.unfiltered_pl(unfiltered_pl_args);

In the above example, the objects cell_filt_type and unfiltered_pl are identical and look like the following:

{
    # hidden, system fields
    type :: "unfiltered_pl", # hidden field
    arguments :: true, # hidden field

    # fields shown in the manifest
    "--unfiltered-pl" : true
}

simpleaf_quant(step, map_type, cell_filt_type, arguments, output)#

Input:

  • step : An integer indicating the step number (execution order) of this simpleaf command record in the workflow.

  • map_type : A map_type object returned by calling utils.map_type or any object with the same format.

  • cell_filt_type : A cell_filt_type object returned by calling utils.cell_filt_type or any object with the same format.

  • arguments : an object in which each field represents a simpleaf quant argument. Furthermore, there must be a field called active representing the active state of this simpleaf index command.

  • output : A string representing the output directory of this simpleaf quant command.

Output:

  • A well-defined simpleaf quant command record.

Example Usage

# import the utility library
local utils=std.extVar("__utils");

local arguments = {
    active : true,
    "--chemistry" : "10xv3",
    "--resolution" : "cr-like"
};

local simpleaf_quant = utils.simpleaf_quant(
    2, # step number
    map_type, # defined in the example usage of function `map_reads`
    cell_filt_type, # defined in the example usage of function `cell_filt_type`
    arguments,
    "./simpleaf_quant" # output directory
);

The simpleaf_quant object in the above code chunk will be

{
    # hidden, system fields
    map_type :: {}, # hidden field. The actual contents are omitted. see above example code for function `map_reads`
    cell_filt_type :: {}, # hidden field. The actual contents are omitted. see above example code for function `cell_filt_type`
    arguments :: {active : true, "--chemistry" : "10xv3", "--resolution" : "cr-like"},  # hidden field
    output :: "./simpleaf_quant", # hidden field

    # fields shown in in the manifest
    program_name : "simpleaf index",
    step : 1,
    active : true,
    "--chemistry": "10xv3",
    "--index": "./workflow_output/simpleaf_index/index",
    "--min-reads": 10,
    "--output": "./workflow_output/simpleaf_quant",
    "--reads1": "reads1.fastq",
    "--reads2": "reads2.fastq",
    "--resolution": "cr-like",
    "--t2g-map": "./workflow_output/simpleaf_index/index/t2g_3col.tsv",
    "--unfiltered-pl": true
}

feature_barcode_ref(start_step, csv, name_col, barcode_col, output)#

Input:

  • start_step: An integer indicating the starting step number (execution order) of the series of command records in the workflow. This function will define three command records with incremental step numbers according to the provided step number.

  • csv: A string representing the path to the “feature_barcode.csv” file of the dataset.

  • name_col: An integer representing the column index of the feature name column in the feature barcode CSV file.

  • barcode_col: An integer representing the column index of the feature barcode sequence column in the feature barcode CSV file.

  • output: A string representing the parent output directory of the result files. It will be created if it doesn’t exist.

Output:

  • An object containing three external command records, including mkdir, create_t2g, and create_fasta, and a hidden object that follows the output format of utils.ref_type shown above. This ref_type object is of the direct_ref type. It can be used as the second argument of utils.simpleaf_index. In this ref_type object,

This function defines three external command records:

  1. mkdir: This command calls the mkdir shell program to create the output directory recursively if it doesn’t exist.

  2. create_t2g: This command calls awk to create a transcript-to-gene mapping TSV file according to the input csv file, in which the transcript ID and gene ID of each feature barcode are identical. The expected output file of this command will be named “.feature_barcode_ref_t2g.tsv” and located in the provided output directory.

  3. create_fasta: This command calls awk to create a FASTA file according to the input csv file, in which each feature barcode is a FASTA record. The expected output file of this command will be named “.feature_barcode_ref.fa” and located in the provided output directory.

Please note that the start_step argument represents the starting step of the series of external commands. If start_step is set to 1, then mkdir will be assigned step 1, create_t2g will be assigned step 2, and so on. Therefore, the step of any future command after the utils.feature_barcode_ref commands should not be less than 4.

Example Usage

# import the utility library
local utils=std.extVar("__utils");

local feature_barcode_ref = utils.feature_barcode_ref(
    1, # start step number
    "feature_barcode.csv", # feature barcode csv
    1, # name_column
    5, # barcode column
    "feature_barcode_ref" # output path
)

The resulting object will look like the following:

{
    # hidden, system fields
    step :: 1,
    last_step :: 3,
    csv :: "feature_barcode.csv",
    output :: "./feature_barcode_ref",
    ref_seq :: "./feature_barcode_ref/.feature_barcode_ref.fa",
    t2g_map :: "./feature_barcode_ref/.feature_barcode_ref_t2g.tsv",

    # external command records
    mkdir : {
        active : true,
        step: step,
        program_name: "mkdir",
        arguments: ["-p", "./feature_barcode_ref"]
    },
    create_t2g : {
        active : true,
        step: step + 1,
        program_name: "awk",
        arguments: ["-F","','","'NR>1 {sub(/ /,\"_\",$1);print $1\"\\t\"$1}'", csv, ">", "./feature_barcode_ref/.feature_barcode_ref_t2g.tsv"]
    },

    create_fasta : {
        active : true,
        step: step + 2,
        program_name: "awk",
        arguments: ["-F","','","'NR>1 {sub(/ /,\"_\",$1);print \">\"$1\"\\n\"$5}'", csv, ">", "./feature_barcode_ref/.feature_barcode_ref.fa"]
    },
    ref_type :: {
        type : "direct_ref",
        t2g_map :: "./feature_barcode_ref/.feature_barcode_ref_t2g.tsv",
        "--ref-seq" : "./feature_barcode_ref/.feature_barcode_ref.fa"
    }
}

barcode_translation(start_step, url, quant_cb, output)#

Input:

  • start_step: An integer indicating the starting step number (execution order) of the series of command records in the workflow. This function will define five command records with incremental step numbers according to the provided step number.

  • url: A string representing the downloadable URL to the barcode mapping file. You can use this URL for 10xv3 data.

  • quant_cb: A string representing the path to the cell barcode file. Usually, this is at af_quant/alevin/quants_mat_rows.txt in the simpleaf quant command output directory.

  • output: A string representing the parent output directory of the result files. It will be created if it doesn’t exist.

Output:

  • An object containing five external command records, including mkdir, fetch_cb_translation_file, unzip_cb_translation_file, backup_bc_file, and barcode_translation.

This function defines five external command records:

  1. mkdir: This command calls the mkdir shell program to create the output directory recursively if it doesn’t exist.

  2. fetch_cb_translation_file: This command calls wget to fetch the barcode mapping file. The expected output file of this command will be called “.barcode.txt.gz”, located in the provided output directory.

  3. unzip_cb_translation_file: This command calls gunzip to decompress the barcode mapping file. The expected output file of this command will be called “.barcode.txt”, located in the provided output directory.

  4. backup_bc_file: This command calls mv to rename the provided barcode file. The expected output file of this command will have the same path as the provided barcode file but with a .bkp suffix.

  5. barcode_translation: This command calls awk to convert the barcodes in the provided barcode file according to the barcode translation file. The expected output file will be put at the provided quant_cb path.

Notice that the start_step argument represents the starting step of the series of external commands. If start_step is set to 1, then mkdir will be assigned as step 1, fetch_cb_translation_file will be assigned step 2, and so on. Therefore, the step of any future command after the barcode_translation commands should not be less than 6.

Example Usage

# import the utility library
local utils=std.extVar("__utils");
local url = "https://github.com/10XGenomics/cellranger/raw/master/lib/python/cellranger/barcodes/translation/3M-february-2018.txt.gz";
local quant_cb = "simpeaf_quant/af_quant/alevin/quants_mat_rows.txt";

local barcode_translation = utils.barcode_translation(
    1, # start step number
    url,
    quant_cb,
    "simpeaf_quant/af_quant/alevin" # output path
)

The resulting object will look like the following:

{
    step :: 1,
    last_step :: 5,
    url :: "https://github.com/10XGenomics/cellranger/raw/master/lib/python/cellranger/barcodes/translation/3M-february-2018.txt.gz",
    quant_cb :: "simpeaf_quant/af_quant/alevin/quants_mat_rows.txt",
    output :: "simpeaf_quant/af_quant/alevin",
    mkdir : {
        active : true,
        step : step,
        program_name : "mkdir",
        arguments : ["-p", "simpeaf_quant/af_quant/alevin"]
    },

    fetch_cb_translation_file : {
        active : true,
        step : step + 1,
        program_name : "wget",
        arguments : ["-O", "simpeaf_quant/af_quant/alevin/.barcode.txt.gz", "https://github.com/10XGenomics/cellranger/raw/master/lib/python/cellranger/barcodes/translation/3M-february-2018.txt.gz"]
    },

    unzip_cb_translation_file : {
        active : true,
        step : step + 2,
        "program_name" : "gunzip",
        "arguments": ["-c", "simpeaf_quant/af_quant/alevin/.barcode.txt.gz", ">", "simpeaf_quant/af_quant/alevin/.barcode.txt"]
    },

    backup_bc_file : {
        active : true,
        step: step + 3,
        program_name: "mv",
        arguments: ["simpeaf_quant/af_quant/alevin/quants_mat_rows.txt", "simpeaf_quant/af_quant/alevin/quants_mat_rows.txt.bkp"]
    },

    // Translate RNA barcode to feature barcode
    barcode_translation : {
        active : true,
        step: step + 4,
        program_name: "awk",
        arguments: ["'FNR==NR {dict[$1]=$2; next} {$1=($1 in dict) ? dict[$1] : $1}1'", "simpeaf_quant/af_quant/alevin/.barcode.txt", "simpeaf_quant/af_quant/alevin/quants_mat_rows.txt.bkp", ">", "simpeaf_quant/af_quant/alevin/quants_mat_rows.txt"]
    },
}

utils.get(o, f, use_default = false, default = null)#

Input:

  • o: an object,

  • f: the target field name,

  • use_default: boolean,

  • default: a default value returned if the target field doesn’t exist.

Output:

  • Return the target field f in the given object if the object has a sub-field called f. Otherwise,

  • if use_default is true, return the value of the default argument.

  • if use_default is false, raise an error.

This function tries to (non-recursively) get a sub-field in the provided object and return it. If the field doesn’t exist, then it either returns a default value or raises an error.

Example Usage

local utils = std.extVar("__utils");

local splici_args = {
    gtf : "genes.gtf",
    fasta : "genome.fa",
    rlen : 91
};

{
    default_behavior : utils.get(splici_args, "gtf"), # this will return "genes.gtf",

    not_exist : utils.get(splici_args, "I do not exist"), # raise error

    provide_default : utils.get(splici_args, "I do not exist", true, "but I have a default value") # this yields "but I have a default value"

}