``index`` command ================= The ``index`` command has two forms of input; either it will take a reference genome FASTA and GTF as input, from which it can build a spliced+intronic (splici) reference or a spliced+unspliced (spliceu) reference using `roers `_ (which is used as a library directly from ``simpleaf``, and so need not be installed independently), or it will take a single reference sequence file (i.e. FASTA file) as input (direct-ref mode). In expanded reference mode, after the expanded reference is constructed, the resulting reference will be indexed with ``piscem build``, and a copy of the 3-column transcript-to-gene file will be placed in the index directory for subsequent use. The output directory will contain both a ``ref`` and ``index`` subdirectory, with the first containing the splici reference that was extracted from the provided genome and GTF, and the latter containing the index built on this reference. In direct-ref mode, if ``--refseq`` is passed, the provided FASTA file will be provided to ``piscem build`` directly. If ``probe_csv`` or ``feature_csv`` is passed, a FASTA file will be created accordingly and provided to ``piscem build``. The output directory will contain an ``index`` subdirectory that contains the index built on this reference. - ``probe_csv``: A CSV file containing probe sequences to use for direct reference indexing. The file must follow the format of `10x Probe Set Reference CSV `_, containing four mandatory columns: `gene_id`, `probe_seq`, `probe_id`, and `included` (must be ``TRUE`` or ``FALSE``), and an optional column: `region` (must be ``spliced`` or ``unspliced``). When parsing the file, ``simpleaf`` will only use the rows where the `included` column is ``TRUE``. For each row, ``simpleaf`` first builds a FASTA record where the identifier is set as `probe_id`, and the sequence is set as `probe_seq`. Then, it will build a t2g file where the first column is `probe_id` and the second column is `gene_id`. If the `region` column exists, the t2g file will include the region information, so as to trigger the USA mode in ``simpleaf quant`` to generate spliced and unspliced count separately. The t2g file will be identified by ``simpleaf quant`` automatically if ``--t2g-map`` is not set. - ``feature_csv``: A CSV file containing feature barcode sequences to use for direct reference indexing. The file must follow the format of `10x Feature Reference CSV `_. Currently, only three columns are used: `id`, `name`, and `sequence`. When parsing the file, ``simpleaf`` first builds a FASTA file using the `id` and `sequence` columns. Then, it will build a t2g file where the transcript is set as `id` and the gene is set as `name`. The t2g file will be identified by ``simpleaf quant`` automatically if ``--t2g-map`` is not set. The relevant options (which you can obtain by running ``simpleaf index -h``) are: .. code-block:: console build the (expanded) reference index Usage: simpleaf index [OPTIONS] --output <--fasta |--ref-seq |--probe-csv |--feature-csv > Options: -o, --output Path to output directory (will be created if it doesn't exist) -t, --threads Number of threads to use when running [default: 16] -k, --kmer-length The value of k to be used to construct the index [default: 31] --gff3-format Denotes that the input annotation is a GFF3 (instead of GTF) file --keep-duplicates Keep duplicated identical sequences when constructing the index --overwrite Overwrite existing files if the output directory is already populated -h, --help Print help -V, --version Print version Expanded Reference Options: --ref-type Specify whether an expanded reference, spliced+intronic (or splici) or spliced+unspliced (or spliceu), should be built [default: spliced+intronic] -f, --fasta Path to a reference genome to be used for the expanded reference construction -g, --gtf Path to a reference GTF/GFF3 file to be used for the expanded reference construction -r, --rlen The Read length used in roers to add flanking lengths to intronic sequences --dedup Deduplicate identical sequences in roers when building the expanded reference --spliced Path to FASTA file with extra spliced sequence to add to the index --unspliced Path to a FASTA file with extra unspliced sequence to add to the index Direct Reference Options: --feature-csv A CSV file containing feature barcode sequences to use for direct reference indexing. The file must follow the format of 10x Feature Reference CSV. Currently, only three columns are used: id, name, and sequence --probe-csv A CSV file containing probe sequences to use for direct reference indexing. The file must follow the format of 10x Probe Set Reference v2 CSV, containing four mandatory columns: gene_id, probe_seq, probe_id, and included (TRUE or FALSE), and an optional column: region (spliced or unspliced) --ref-seq A FASTA file containing reference sequences to directly build index on, and avoid expanded reference construction Piscem Index Options: -m, --minimizer-length Minimizer length to be used to construct the piscem index (must be < k) [default: 19] --decoy-paths Paths to decoy sequence FASTA files used to insert poison k-mer information into the index (only if using piscem >= 0.7) --seed The seed value to use in SSHash index construction (try changing this in the rare event index build fails) [default: 1] --work-dir The working directory where temporary files should be placed [default: ./workdir.noindex]