index commandΒΆ

The index command has two forms of input; either it will take a reference genome FASTA and GTF as input, from which it can build a spliced+intronic (splici) reference or a spliced+unspliced (spliceu) reference using roers (which is used as a library directly from simpleaf, and so need not be installed independently), or it will take a single reference sequence file (i.e. FASTA file) as input (direct-ref mode).

In expanded reference mode, after the expanded reference is constructed, the resulting reference will be indexed with piscem build, and a copy of the 3-column transcript-to-gene file will be placed in the index directory for subsequent use. The output directory will contain both a ref and index subdirectory, with the first containing the splici reference that was extracted from the provided genome and GTF, and the latter containing the index built on this reference.

In direct-ref mode, if --refseq is passed, the provided FASTA file will be provided to piscem build directly. If probe_csv or feature_csv is passed, a FASTA file will be created accordingly and provided to piscem build. The output directory will contain an index subdirectory that contains the index built on this reference.

  • probe_csv: A CSV file containing probe sequences to use for direct reference indexing. The file must follow the format of 10x Probe Set Reference CSV, containing four mandatory columns: gene_id, probe_seq, probe_id, and included (must be TRUE or FALSE), and an optional column: region (must be spliced or unspliced). When parsing the file, simpleaf will only use the rows where the included column is TRUE. For each row, simpleaf first builds a FASTA record where the identifier is set as probe_id, and the sequence is set as probe_seq. Then, it will build a t2g file where the first column is probe_id and the second column is gene_id. If the region column exists, the t2g file will include the region information, so as to trigger the USA mode in simpleaf quant to generate spliced and unspliced count separately. The t2g file will be identified by simpleaf quant automatically if --t2g-map is not set.

  • feature_csv: A CSV file containing feature barcode sequences to use for direct reference indexing. The file must follow the format of 10x Feature Reference CSV. Currently, only three columns are used: id, name, and sequence. When parsing the file, simpleaf first builds a FASTA file using the id and sequence columns. Then, it will build a t2g file where the transcript is set as id and the gene is set as name. The t2g file will be identified by simpleaf quant automatically if --t2g-map is not set.

The relevant options (which you can obtain by running simpleaf index -h) are:

build the (expanded) reference index

Usage: simpleaf index [OPTIONS] --output <OUTPUT> <--fasta <FASTA>|--ref-seq <REF_SEQ>|--probe-csv <PROBE_CSV>|--feature-csv <FEATURE_CSV>>

Options:
-o, --output <OUTPUT>            Path to output directory (will be created if it doesn't exist)
-t, --threads <THREADS>          Number of threads to use when running [default: 16]
-k, --kmer-length <KMER_LENGTH>  The value of k to be used to construct the index [default: 31]
    --gff3-format                Denotes that the input annotation is a GFF3 (instead of GTF)
                                file
    --keep-duplicates            Keep duplicated identical sequences when constructing the index
    --overwrite                  Overwrite existing files if the output directory is already
                                populated
-h, --help                       Print help
-V, --version                    Print version

Expanded Reference Options:
    --ref-type <REF_TYPE>    Specify whether an expanded reference, spliced+intronic (or splici)
                            or spliced+unspliced (or spliceu), should be built [default:
                            spliced+intronic]
-f, --fasta <FASTA>          Path to a reference genome to be used for the expanded reference
                            construction
-g, --gtf <GTF>              Path to a reference GTF/GFF3 file to be used for the expanded
                            reference construction
-r, --rlen <RLEN>            The Read length used in roers to add flanking lengths to intronic
                            sequences
    --dedup                  Deduplicate identical sequences in roers when building the expanded
                            reference
    --spliced <SPLICED>      Path to FASTA file with extra spliced sequence to add to the index
    --unspliced <UNSPLICED>  Path to a FASTA file with extra unspliced sequence to add to the
                            index

Direct Reference Options:
    --feature-csv <FEATURE_CSV>  A CSV file containing feature barcode sequences to use for
                                direct reference indexing. The file must follow the format of
                                10x Feature Reference CSV. Currently, only three columns are
                                used: id, name, and sequence
    --probe-csv <PROBE_CSV>      A CSV file containing probe sequences to use for direct
                                reference indexing. The file must follow the format of 10x Probe
                                Set Reference v2 CSV, containing four mandatory columns:
                                gene_id, probe_seq, probe_id, and included (TRUE or FALSE), and
                                an optional column: region (spliced or unspliced)
    --ref-seq <REF_SEQ>          A FASTA file containing reference sequences to directly build
                                index on, and avoid expanded reference construction

Piscem Index Options:
-m, --minimizer-length <MINIMIZER_LENGTH>
        Minimizer length to be used to construct the piscem index (must be < k) [default: 19]
    --decoy-paths <DECOY_PATHS>
        Paths to decoy sequence FASTA files used to insert poison k-mer information into the
        index (only if using piscem >= 0.7)
    --seed <HASH_SEED>
        The seed value to use in SSHash index construction (try changing this in the rare event
        index build fails) [default: 1]
    --work-dir <WORK_DIR>
        The working directory where temporary files should be placed [default: ./workdir.noindex]