Reference Construction

This section describes how the reference set (reference sequences and haplotyping position) can be generated.

Note

This section needs extension and more explanation.

Input

The overall input is a TSV file seeds_accessions that lists for each haplotype and region a GenBank accession with the prototype sequence.

species              haplotype  region   accession   source

Ca. L. solanacearum  A          16S      FJ498802.1  .
Ca. L. solanacearum  B          16S      FJ939136.1  .
Ca. L. solanacearum  C          16S      GU373048.1  .
Ca. L. solanacearum  D          16S      HQ454302.1  .
Ca. L. solanacearum  E          16S      KF737348.1  .

Ca. L. solanacearum  A          16S-23S  FJ830690.1  .
Ca. L. solanacearum  B          16S-23S  FJ830700.1  .
Ca. L. solanacearum  C          16S-23S  JX280523.1  .
Ca. L. solanacearum  D          16S-23S  JX308304.1  .
Ca. L. solanacearum  E          16S-23S  KF737347.1  .

Ca. L. solanacearum  A          50S      EU834131.1  .
Ca. L. solanacearum  B          50S      FJ498807.1  .
Ca. L. solanacearum  C          50S      GU373051.1  .
Ca. L. solanacearum  D          50S      HQ454317.1  .
Ca. L. solanacearum  E          50S      KY777461.1  .

Download Seed Sequences

$ hlso cli ref_download \
    path/to/seeds_accession.tsv \
    path/to/seeds_paths.tsv

This will download sequences by accession, download them next to the seeds_accession.tsv file. It will write the file seeds_paths.tsv with the names of the downloaded files:

species              haplotype  region   accession   path
Ca. L. solanacearum  A          16S      FJ498802.1  FJ498802.1.fasta
Ca. L. solanacearum  B          16S      FJ939136.1  GU373048.1.fasta
Ca. L. solanacearum  C          16S      GU373048.1  GU373048.1.fasta
Ca. L. solanacearum  D          16S      HQ454302.1  HQ454302.1.fasta
Ca. L. solanacearum  E          16S      KF737348.1  KF737348.1.fasta
Ca. L. solanacearum  A          16S-23S  FJ830690.1  FJ830690.1.fasta
Ca. L. solanacearum  B          16S-23S  FJ830700.1  FJ830700.1.fasta
Ca. L. solanacearum  C          16S-23S  JX280523.1  JX280523.1.fasta
Ca. L. solanacearum  D          16S-23S  JX308304.1  JX308304.1.fasta
Ca. L. solanacearum  E          16S-23S  KF737347.1  KF737347.1.fasta
Ca. L. solanacearum  A          50S      EU834131.1  EU834131.1.fasta
Ca. L. solanacearum  B          50S      FJ498807.1  FJ498807.1.fasta
Ca. L. solanacearum  C          50S      GU373051.1  GU373051.1.fasta
Ca. L. solanacearum  D          50S      HQ454317.1  HQ454317.1.fasta
Ca. L. solanacearum  E          50S      KY777461.1  KY777461.1.fasta

Performing Seed BLAST Queries

The next step is to perform a BLAST search via NCBI WWWBLAST to obtain sequences similar to the seeds.

$ hlso ref_blast path/to/seeds_paths.tsv

For each seeed query accession.fasta, a file accession.blast.xml will be generated with the BLAST results.

Consensus and Table Creation

Finally, compute consensus sequences and the haplotyping table.

$ hls ref_consensus path/to/seeds_path.tsv \
    --output-table haplotype_table.txt

This will perform a consensus computation of the seeds, generate a haplotype-specific sequence for each region and each haplotype, and create a haplotyping table.

The file haplotype_table.txt can then be used for the haplotyping of sequences themselves.