fast_conformation.msa_generation package

Submodules

fast_conformation.msa_generation.colabfold module

fast_conformation.msa_generation.colabfold.chain_break(idx_res, Ls, length=200)[source]

Adds a large number to residue indices to indicate chain breaks in a sequence.

Parameters:
  • idx_res (ndarray) – The array of residue indices.

  • Ls (list of int) – The lengths of different segments in the sequence.

  • length (int) – The value to add to the residue index at chain breaks.

Returns:

The updated array of residue indices with chain breaks.

Return type:

ndarray

fast_conformation.msa_generation.colabfold.get_hash(x)[source]

Generate a SHA-1 hash for a given string.

Parameters:

x (str) – The input string to be hashed.

Returns:

The SHA-1 hash of the input string.

Return type:

str

fast_conformation.msa_generation.colabfold.homooligomerize(msas, deletion_matrices, homooligomer=1)[source]

Homooligomerizes the input MSAs (Multiple Sequence Alignments) and deletion matrices.

Parameters:
  • msas (list of lists) – A list of MSAs.

  • deletion_matrices (list of lists) – A list of deletion matrices corresponding to the MSAs.

  • homooligomer (int) – The number of homooligomeric copies. Default is 1 (no homooligomerization).

Returns:

A tuple containing the homooligomerized MSAs and deletion matrices.

Return type:

tuple

fast_conformation.msa_generation.colabfold.homooligomerize_heterooligomer(msas, deletion_matrices, lengths, homooligomers)[source]

Homooligomerizes the input MSAs and deletion matrices for heterooligomeric complexes.

Parameters:
  • msas (list of lists) – A list of MSAs.

  • deletion_matrices (list of lists) – A list of deletion matrices corresponding to the MSAs.

  • lengths (list of int) – A list of lengths for each component in the complex.

  • homooligomers (list of int) – A list of homooligomeric copies for each component.

Returns:

A tuple containing the homooligomerized MSAs and deletion matrices.

Return type:

tuple

fast_conformation.msa_generation.colabfold.homooliomerize(msas, deletion_matrices, homooligomer=1)[source]

Homooligomerizes the input MSAs and deletion matrices. This function is a typo version of homooligomerize for cross-compatibility.

Parameters:
  • msas (list of lists) – A list of MSAs.

  • deletion_matrices (list of lists) – A list of deletion matrices corresponding to the MSAs.

  • homooligomer (int) – The number of homooligomeric copies. Default is 1 (no homooligomerization).

Returns:

A tuple containing the homooligomerized MSAs and deletion matrices.

Return type:

tuple

fast_conformation.msa_generation.colabfold.plot_confidence(plddt, pae=None, Ls=None, dpi=100)[source]

Plots predicted confidence metrics (pLDDT and PAE) for a protein structure.

Parameters:
  • plddt (ndarray) – Array of predicted Local Distance Difference Test (pLDDT) scores.

  • pae (ndarray) – Array of Predicted Aligned Error (PAE) scores (optional).

  • Ls (list of int) – The lengths of different segments in the sequence (optional).

  • dpi (int) – Dots per inch setting for the plot.

Returns:

The plot object displaying the confidence metrics.

Return type:

matplotlib.pyplot

fast_conformation.msa_generation.colabfold.plot_msas(msas, ori_seq=None, sort_by_seqid=True, deduplicate=True, dpi=100, return_plt=True)[source]

Plots Multiple Sequence Alignments (MSAs).

Parameters:
  • msas (list of lists) – A list of MSAs to be plotted.

  • ori_seq (str) – The original sequence (optional).

  • sort_by_seqid (bool) – Whether to sort sequences by sequence identity (default: True).

  • deduplicate (bool) – Whether to remove duplicate sequences (default: True).

  • dpi (int) – Dots per inch setting for the plot.

  • return_plt (bool) – Whether to return the plot object (default: True).

Returns:

The plot object displaying the MSAs, if return_plt is True.

Return type:

matplotlib.pyplot

fast_conformation.msa_generation.colabfold.plot_plddt_legend(dpi=100)[source]

Plots a legend for pLDDT (predicted Local Distance Difference Test) scores.

Parameters:

dpi (int) – Dots per inch setting for the plot.

Returns:

The plot object with the pLDDT legend.

Return type:

matplotlib.pyplot

fast_conformation.msa_generation.colabfold.plot_ticks(Ls)[source]

Plots tick marks indicating segment boundaries on a plot.

Parameters:

Ls (list of int) – The lengths of different segments in the sequence.

fast_conformation.msa_generation.get_msa_jackhmmer module

fast_conformation.msa_generation.get_msa_jackhmmer.prep_inputs(sequence, jobname='test', homooligomer='1', output_dir=None, clean=False, verbose=True)[source]

Prepares the input sequence and parameters for MSA generation.

Parameters:
  • sequence (str) – The protein sequence to be processed.

  • jobname (str) – The name of the job. Default is “test”.

  • homooligomer (str) – A string specifying the number of homooligomers for each sequence segment. Default is “1”.

  • output_dir (str) – The directory where output files will be saved. If None, a default directory is created based on the jobname and sequence hash.

  • clean (bool) – If True, cleans the output directory by removing existing files. Default is False.

  • verbose (bool) – If True, prints warnings and information during execution. Default is True.

Returns:

A dictionary containing the processed inputs, including sequences, homooligomer information,

and output directory.

Return type:

dict

fast_conformation.msa_generation.get_msa_jackhmmer.prep_msa(I, msa_method='mmseqs2', add_custom_msa=False, msa_format='fas', pair_mode='unpaired', pair_cov=50, pair_qid=20, hhfilter_loc='hhfilter', reformat_loc='reformat.pl', TMP_DIR='tmp', custom_msa=None, precomputed=None, mmseqs_host_url='https://a3m.mmseqs.com', verbose=True, use_ramdisk=False)[source]

Prepares and processes MSAs for the given sequences using the specified MSA method.

Parameters:
  • I (dict) – Dictionary containing input sequences and other parameters.

  • msa_method (str) – Method used to generate MSAs. Default is “mmseqs2”.

  • add_custom_msa (bool) – Whether to add a custom MSA. Default is False.

  • msa_format (str) – The format of the MSA file. Default is “fas”.

  • pair_mode (str) – Pairing mode for sequences. Can be “unpaired”, “paired”, or “unpaired+paired”. Default is “unpaired”.

  • pair_cov (int) – Coverage threshold for pairing sequences. Default is 50.

  • pair_qid (int) – Identity threshold for pairing sequences. Default is 20.

  • hhfilter_loc (str) – Path to the hhfilter binary. Default is “hhfilter”.

  • reformat_loc (str) – Path to the reformat.pl script. Default is “reformat.pl”.

  • TMP_DIR (str) – Path to the temporary directory. Default is “tmp”.

  • custom_msa (str) – Path to a custom MSA file (optional).

  • precomputed (str) – Path to a precomputed MSA file (optional).

  • mmseqs_host_url (str) – URL of the MMseqs2 server. Default is “https://a3m.mmseqs.com”.

  • verbose (bool) – If True, prints progress and information during execution. Default is True.

  • use_ramdisk (bool) – If True, uses a RAM disk for temporary storage. Default is False.

Returns:

The updated dictionary I containing the generated MSAs and deletion matrices.

Return type:

dict

fast_conformation.msa_generation.get_msa_jackhmmer.run_jackhmmer(sequence, prefix, jackhmmer_binary_path='jackhmmer', verbose=True, use_ramdisk=False)[source]

Runs the jackhmmer tool to search for homologous sequences in a protein sequence database.

Parameters:
  • sequence (str) – The query protein sequence.

  • prefix (str) – The prefix for output files.

  • jackhmmer_binary_path (str) – Path to the jackhmmer binary executable. Default is ‘jackhmmer’.

  • verbose (bool) – If True, prints progress and information during execution. Default is True.

  • use_ramdisk (bool) – If True, uses a RAM disk for temporary storage. Default is False.

Returns:

A tuple containing the MSAs, deletion matrices, and sequence names.

Return type:

tuple

fast_conformation.msa_generation.jackhmmer module

Library to run Jackhmmer from Python.

class fast_conformation.msa_generation.jackhmmer.Jackhmmer(*, binary_path: str, database_path: str, use_ramdisk: bool = False, n_cpu: int = 8, n_iter: int = 1, e_value: float = 0.0001, z_value: int | None = None, get_tblout: bool = False, filter_f1: float = 5e-09, filter_f2: float = 5e-13, filter_f3: float = 5e-15, incdom_e: float | None = None, dom_e: float | None = None, num_streamed_chunks: int | None = None, streaming_callback: Callable[[int], None] | None = None)[source]

Bases: object

Python wrapper of the Jackhmmer binary.

query(input_fasta_path: str) Sequence[Mapping[str, Any]][source]

Queries the database using Jackhmmer.

fast_conformation.msa_generation.msa_utils module

Common utilities for data pipeline tools.

fast_conformation.msa_generation.msa_utils.create_directory(path)[source]
fast_conformation.msa_generation.msa_utils.create_ram_disk()[source]
fast_conformation.msa_generation.msa_utils.read_fasta(file_path)[source]
fast_conformation.msa_generation.msa_utils.save_dict_to_fasta(seq_dict, output_path, jobname)[source]
fast_conformation.msa_generation.msa_utils.timing(msg: str)[source]
fast_conformation.msa_generation.msa_utils.tmpdir_manager(base_dir: str | None = None)[source]

Context manager that deletes a temporary directory on exit.

fast_conformation.msa_generation.pairmsa module

fast_conformation.msa_generation.pairmsa.get_uni_jackhmmer(msa, mtx, lab, filter_qid=0.15, filter_cov=0.5)[source]

Filters sequences to retain only UniProt entries from a multiple sequence alignment (MSA).

Parameters:
  • msa (list of str) – List of sequences in the MSA.

  • mtx (list of list of int) – List of deletion matrices corresponding to the MSA.

  • lab (list of str) – List of labels corresponding to the MSA.

  • filter_qid (float) – Minimum sequence identity to retain a sequence. Default is 0.15.

  • filter_cov (float) – Minimum coverage to retain a sequence. Default is 0.5.

Returns:

Filtered (msa, mtx, lab) where each is a list.

Return type:

tuple

fast_conformation.msa_generation.pairmsa.hash_it(_seq, _lab, _mtx, call_uniprot=False)[source]

Generates a hash for a given sequence and label.

Parameters:
  • _seq (list of str) – List of sequences.

  • _lab (list of str) – List of labels corresponding to the sequences.

  • _mtx (list of list of int) – List of deletion matrices corresponding to the sequences.

  • call_uniprot (bool) – Whether to query UniProt for mapping information. Default is False.

Returns:

Contains mappings of sequences, labels, and hashes.

Return type:

dict

fast_conformation.msa_generation.pairmsa.map_retrieve(ids, call_uniprot=False)[source]

Maps UniRef IDs to UniProt accession numbers.

Parameters:
  • ids (list of str) – List of UniRef IDs.

  • call_uniprot (bool) – Whether to query UniProt for mapping information. Default is False.

Returns:

Mapping from UniRef IDs to UniProt accession numbers.

Return type:

dict

fast_conformation.msa_generation.pairmsa.parse_a3m(a3m_lines=None, a3m_file=None, filter_qid=0.15, filter_cov=0.5, N=100000)[source]

Parses an A3M file or lines and filters sequences based on sequence identity and coverage.

Parameters:
  • a3m_lines (list of str) – Lines from an A3M file (optional).

  • a3m_file (str) – Path to an A3M file (optional).

  • filter_qid (float) – Minimum sequence identity to retain a sequence. Default is 0.15.

  • filter_cov (float) – Minimum coverage to retain a sequence. Default is 0.5.

  • N (int) – Maximum number of sequences to retain. Default is 100000.

Returns:

(sequences, deletion_matrices, names) where each is a list.

Return type:

tuple

fast_conformation.msa_generation.pairmsa.stitch(_hash_a, _hash_b, stitch_min=1, stitch_max=20, filter_id=None)[source]

Stitches two hashed sequences together based on their alignment.

Parameters:
  • _hash_a (dict) – First sequence hash information.

  • _hash_b (dict) – Second sequence hash information.

  • stitch_min (int) – Minimum allowed distance between aligned sequences. Default is 1.

  • stitch_max (int) – Maximum allowed distance between aligned sequences. Default is 20.

  • filter_id (None) – Placeholder for a potential filtering ID (not used).

Returns:

(sequences, deletion matrices) for the stitched sequences.

Return type:

tuple

fast_conformation.msa_generation.pairmsa.uni_num(ids)[source]

Converts UniProt IDs to numerical representations.

Parameters:

ids (list of str) – List of UniProt IDs.

Returns:

Numerical representations of the UniProt IDs.

Return type:

list of int

fast_conformation.msa_generation.parsers module

Functions for parsing various file formats.

class fast_conformation.msa_generation.parsers.TemplateHit(index: int, name: str, aligned_cols: int, sum_probs: float, query: str, hit_sequence: str, indices_query: List[int], indices_hit: List[int])[source]

Bases: object

Class representing a template hit.

aligned_cols: int
hit_sequence: str
index: int
indices_hit: List[int]
indices_query: List[int]
name: str
query: str
sum_probs: float
fast_conformation.msa_generation.parsers.convert_stockholm_to_a3m(stockholm_format: str, max_sequences: int | None = None) str[source]

Converts MSA in Stockholm format to the A3M format.

fast_conformation.msa_generation.parsers.parse_a3m(a3m_string: str) Tuple[Sequence[str], Sequence[Sequence[int]]][source]

Parses sequences and deletion matrix from a3m format alignment.

Parameters:

a3m_string – The string contents of a a3m file. The first sequence in the file should be the query sequence.

Returns:

  • A list of sequences that have been aligned to the query. These might contain duplicates.

  • The deletion matrix for the alignment as a list of lists. The element at deletion_matrix[i][j] is the number of residues deleted from the aligned sequence i at residue position j.

Return type:

A tuple of

fast_conformation.msa_generation.parsers.parse_e_values_from_tblout(tblout: str) Dict[str, float][source]

Parse target to e-value mapping parsed from Jackhmmer tblout string.

fast_conformation.msa_generation.parsers.parse_fasta(fasta_string: str) Tuple[Sequence[str], Sequence[str]][source]

Parses FASTA string and returns list of strings with amino-acid sequences.

Parameters:

fasta_string – The string contents of a FASTA file.

Returns:

  • A list of sequences.

  • A list of sequence descriptions taken from the comment lines. In the same order as the sequences.

Return type:

A tuple of two lists

fast_conformation.msa_generation.parsers.parse_hhr(hhr_string: str) Sequence[TemplateHit][source]

Parses the content of an entire HHR file.

fast_conformation.msa_generation.parsers.parse_stockholm(stockholm_string: str) Tuple[Sequence[str], Sequence[Sequence[int]], Sequence[str]][source]

Parses sequences and deletion matrix from stockholm format alignment.

Parameters:

stockholm_string – The string contents of a stockholm file. The first sequence in the file should be the query sequence.

Returns:

  • A list of sequences that have been aligned to the query. These might contain duplicates.

  • The deletion matrix for the alignment as a list of lists. The element at deletion_matrix[i][j] is the number of residues deleted from the aligned sequence i at residue position j.

  • The names of the targets matched, including the jackhmmer subsequence suffix.

Return type:

A tuple of

Module contents