index commandΒΆ
The index command has two forms of input; either it will take a reference genome FASTA and GTF as input, from which it can build a spliced+intronic (splici) reference or a spliced+unspliced (spliceu) reference using roers (which is used as a library directly from simpleaf, and so need not be installed independently), or it will take a single reference sequence file (i.e. FASTA file) as input (direct-ref mode).
In expanded reference mode, after the expanded reference is constructed, the resulting reference will be indexed with piscem build, and a copy of the 3-column transcript-to-gene file will be placed in the index directory for subsequent use. The output directory will contain both a ref and index subdirectory, with the first containing the splici reference that was extracted from the provided genome and GTF, and the latter containing the index built on this reference.
In direct-ref mode, if --refseq is passed, the provided FASTA file will be provided to piscem build directly. If probe_csv or feature_csv is passed, a FASTA file will be created accordingly and provided to piscem build. The output directory will contain an index subdirectory that contains the index built on this reference.
probe_csv: A CSV file containing probe sequences to use for direct reference indexing. The file must follow the format of 10x Probe Set Reference CSV, containing four mandatory columns: gene_id, probe_seq, probe_id, and included (must beTRUEorFALSE), and an optional column: region (must besplicedorunspliced). When parsing the file,simpleafwill only use the rows where the included column isTRUE. For each row,simpleaffirst builds a FASTA record where the identifier is set as probe_id, and the sequence is set as probe_seq. Then, it will build a t2g file where the first column is probe_id and the second column is gene_id. If the region column exists, the t2g file will include the region information, so as to trigger the USA mode insimpleaf quantto generate spliced and unspliced count separately. The t2g file will be identified bysimpleaf quantautomatically if--t2g-mapis not set.feature_csv: A CSV file containing feature barcode sequences to use for direct reference indexing. The file must follow the format of 10x Feature Reference CSV. Currently, only three columns are used: id, name, and sequence. When parsing the file,simpleaffirst builds a FASTA file using the id and sequence columns. Then, it will build a t2g file where the transcript is set as id and the gene is set as name. The t2g file will be identified bysimpleaf quantautomatically if--t2g-mapis not set.
The relevant options (which you can obtain by running simpleaf index -h) are:
build the (expanded) reference index
Usage: simpleaf index [OPTIONS] --output <OUTPUT> <--fasta <FASTA>|--ref-seq <REF_SEQ>|--probe-csv <PROBE_CSV>|--feature-csv <FEATURE_CSV>>
Options:
-o, --output <OUTPUT> Path to output directory (will be created if it doesn't exist)
-t, --threads <THREADS> Number of threads to use when running [default: 16]
-k, --kmer-length <KMER_LENGTH> The value of k to be used to construct the index [default: 31]
--gff3-format Denotes that the input annotation is a GFF3 (instead of GTF)
file
--keep-duplicates Keep duplicated identical sequences when constructing the index
--overwrite Overwrite existing files if the output directory is already
populated
-h, --help Print help
-V, --version Print version
Expanded Reference Options:
--ref-type <REF_TYPE> Specify whether an expanded reference, spliced+intronic (or splici)
or spliced+unspliced (or spliceu), should be built [default:
spliced+intronic]
-f, --fasta <FASTA> Path to a reference genome to be used for the expanded reference
construction
-g, --gtf <GTF> Path to a reference GTF/GFF3 file to be used for the expanded
reference construction
-r, --rlen <RLEN> The Read length used in roers to add flanking lengths to intronic
sequences
--dedup Deduplicate identical sequences in roers when building the expanded
reference
--spliced <SPLICED> Path to FASTA file with extra spliced sequence to add to the index
--unspliced <UNSPLICED> Path to a FASTA file with extra unspliced sequence to add to the
index
Direct Reference Options:
--feature-csv <FEATURE_CSV> A CSV file containing feature barcode sequences to use for
direct reference indexing. The file must follow the format of
10x Feature Reference CSV. Currently, only three columns are
used: id, name, and sequence
--probe-csv <PROBE_CSV> A CSV file containing probe sequences to use for direct
reference indexing. The file must follow the format of 10x Probe
Set Reference v2 CSV, containing four mandatory columns:
gene_id, probe_seq, probe_id, and included (TRUE or FALSE), and
an optional column: region (spliced or unspliced)
--ref-seq <REF_SEQ> A FASTA file containing reference sequences to directly build
index on, and avoid expanded reference construction
Piscem Index Options:
-m, --minimizer-length <MINIMIZER_LENGTH>
Minimizer length to be used to construct the piscem index (must be < k) [default: 19]
--decoy-paths <DECOY_PATHS>
Paths to decoy sequence FASTA files used to insert poison k-mer information into the
index (only if using piscem >= 0.7)
--seed <HASH_SEED>
The seed value to use in SSHash index construction (try changing this in the rare event
index build fails) [default: 1]
--work-dir <WORK_DIR>
The working directory where temporary files should be placed [default: ./workdir.noindex]