fast_conformation.msa_generation package¶

Submodules¶

fast_conformation.msa_generation.colabfold module¶

fast_conformation.msa_generation.colabfold.chain_break(idx_res, Ls, length=200)[source]¶

Adds a large number to residue indices to indicate chain breaks in a sequence.

Parameters:

idx_res (ndarray) – The array of residue indices.
Ls (list of int) – The lengths of different segments in the sequence.
length (int) – The value to add to the residue index at chain breaks.

Returns:

The updated array of residue indices with chain breaks.

Return type:

ndarray

fast_conformation.msa_generation.colabfold.get_hash(x)[source]¶

Generate a SHA-1 hash for a given string.

Parameters:: x (str) – The input string to be hashed.
Returns:: The SHA-1 hash of the input string.
Return type:: str

fast_conformation.msa_generation.colabfold.homooligomerize(msas, deletion_matrices, homooligomer=1)[source]¶

Homooligomerizes the input MSAs (Multiple Sequence Alignments) and deletion matrices.

Parameters:

msas (list of lists) – A list of MSAs.
deletion_matrices (list of lists) – A list of deletion matrices corresponding to the MSAs.
homooligomer (int) – The number of homooligomeric copies. Default is 1 (no homooligomerization).

Returns:

A tuple containing the homooligomerized MSAs and deletion matrices.

Return type:

tuple

fast_conformation.msa_generation.colabfold.homooligomerize_heterooligomer(msas, deletion_matrices, lengths, homooligomers)[source]¶

Homooligomerizes the input MSAs and deletion matrices for heterooligomeric complexes.

Parameters:

msas (list of lists) – A list of MSAs.
deletion_matrices (list of lists) – A list of deletion matrices corresponding to the MSAs.
lengths (list of int) – A list of lengths for each component in the complex.
homooligomers (list of int) – A list of homooligomeric copies for each component.

Returns:

A tuple containing the homooligomerized MSAs and deletion matrices.

Return type:

tuple

fast_conformation.msa_generation.colabfold.homooliomerize(msas, deletion_matrices, homooligomer=1)[source]¶

Homooligomerizes the input MSAs and deletion matrices. This function is a typo version of homooligomerize for cross-compatibility.

Parameters:

msas (list of lists) – A list of MSAs.
deletion_matrices (list of lists) – A list of deletion matrices corresponding to the MSAs.
homooligomer (int) – The number of homooligomeric copies. Default is 1 (no homooligomerization).

Returns:

A tuple containing the homooligomerized MSAs and deletion matrices.

Return type:

tuple

fast_conformation.msa_generation.colabfold.plot_confidence(plddt, pae=None, Ls=None, dpi=100)[source]¶

Plots predicted confidence metrics (pLDDT and PAE) for a protein structure.

Parameters:

plddt (ndarray) – Array of predicted Local Distance Difference Test (pLDDT) scores.
pae (ndarray) – Array of Predicted Aligned Error (PAE) scores (optional).
Ls (list of int) – The lengths of different segments in the sequence (optional).
dpi (int) – Dots per inch setting for the plot.

Returns:

The plot object displaying the confidence metrics.

Return type:

matplotlib.pyplot

fast_conformation.msa_generation.colabfold.plot_msas(msas, ori_seq=None, sort_by_seqid=True, deduplicate=True, dpi=100, return_plt=True)[source]¶

Plots Multiple Sequence Alignments (MSAs).

Parameters:

msas (list of lists) – A list of MSAs to be plotted.
ori_seq (str) – The original sequence (optional).
sort_by_seqid (bool) – Whether to sort sequences by sequence identity (default: True).
deduplicate (bool) – Whether to remove duplicate sequences (default: True).
dpi (int) – Dots per inch setting for the plot.
return_plt (bool) – Whether to return the plot object (default: True).

Returns:

The plot object displaying the MSAs, if return_plt is True.

Return type:

matplotlib.pyplot

fast_conformation.msa_generation.colabfold.plot_plddt_legend(dpi=100)[source]¶

Plots a legend for pLDDT (predicted Local Distance Difference Test) scores.

Parameters:: dpi (int) – Dots per inch setting for the plot.
Returns:: The plot object with the pLDDT legend.
Return type:: matplotlib.pyplot

fast_conformation.msa_generation.colabfold.plot_ticks(Ls)[source]¶

Plots tick marks indicating segment boundaries on a plot.

Parameters:: Ls (list of int) – The lengths of different segments in the sequence.

fast_conformation.msa_generation.get_msa_jackhmmer module¶

fast_conformation.msa_generation.get_msa_jackhmmer.prep_inputs(sequence, jobname='test', homooligomer='1', output_dir=None, clean=False, verbose=True)[source]¶

Prepares the input sequence and parameters for MSA generation.

Parameters:

sequence (str) – The protein sequence to be processed.
jobname (str) – The name of the job. Default is “test”.
homooligomer (str) – A string specifying the number of homooligomers for each sequence segment. Default is “1”.
output_dir (str) – The directory where output files will be saved. If None, a default directory is created based on the jobname and sequence hash.
clean (bool) – If True, cleans the output directory by removing existing files. Default is False.
verbose (bool) – If True, prints warnings and information during execution. Default is True.

Returns:

A dictionary containing the processed inputs, including sequences, homooligomer information,: and output directory.

Return type:

dict

fast_conformation.msa_generation.get_msa_jackhmmer.prep_msa(I, msa_method='mmseqs2', add_custom_msa=False, msa_format='fas', pair_mode='unpaired', pair_cov=50, pair_qid=20, hhfilter_loc='hhfilter', reformat_loc='reformat.pl', TMP_DIR='tmp', custom_msa=None, precomputed=None, mmseqs_host_url='https://a3m.mmseqs.com', verbose=True, use_ramdisk=False)[source]¶

Prepares and processes MSAs for the given sequences using the specified MSA method.

Parameters:

I (dict) – Dictionary containing input sequences and other parameters.
msa_method (str) – Method used to generate MSAs. Default is “mmseqs2”.
add_custom_msa (bool) – Whether to add a custom MSA. Default is False.
msa_format (str) – The format of the MSA file. Default is “fas”.
pair_mode (str) – Pairing mode for sequences. Can be “unpaired”, “paired”, or “unpaired+paired”. Default is “unpaired”.
pair_cov (int) – Coverage threshold for pairing sequences. Default is 50.
pair_qid (int) – Identity threshold for pairing sequences. Default is 20.
hhfilter_loc (str) – Path to the hhfilter binary. Default is “hhfilter”.
reformat_loc (str) – Path to the reformat.pl script. Default is “reformat.pl”.
TMP_DIR (str) – Path to the temporary directory. Default is “tmp”.
custom_msa (str) – Path to a custom MSA file (optional).
precomputed (str) – Path to a precomputed MSA file (optional).
mmseqs_host_url (str) – URL of the MMseqs2 server. Default is “https://a3m.mmseqs.com”.
verbose (bool) – If True, prints progress and information during execution. Default is True.
use_ramdisk (bool) – If True, uses a RAM disk for temporary storage. Default is False.

Returns:

The updated dictionary I containing the generated MSAs and deletion matrices.

Return type:

dict

fast_conformation.msa_generation.get_msa_jackhmmer.run_jackhmmer(sequence, prefix, jackhmmer_binary_path='jackhmmer', verbose=True, use_ramdisk=False)[source]¶

Runs the jackhmmer tool to search for homologous sequences in a protein sequence database.

Parameters:

sequence (str) – The query protein sequence.
prefix (str) – The prefix for output files.
jackhmmer_binary_path (str) – Path to the jackhmmer binary executable. Default is ‘jackhmmer’.
verbose (bool) – If True, prints progress and information during execution. Default is True.
use_ramdisk (bool) – If True, uses a RAM disk for temporary storage. Default is False.

Returns:

A tuple containing the MSAs, deletion matrices, and sequence names.

Return type:

tuple

fast_conformation.msa_generation.jackhmmer module¶

Library to run Jackhmmer from Python.

class fast_conformation.msa_generation.jackhmmer.Jackhmmer(*, binary_path: str, database_path: str, use_ramdisk: bool = False, n_cpu: int = 8, n_iter: int = 1, e_value: float = 0.0001, z_value: int | None = None, get_tblout: bool = False, filter_f1: float = 5e-09, filter_f2: float = 5e-13, filter_f3: float = 5e-15, incdom_e: float | None = None, dom_e: float | None = None, num_streamed_chunks: int | None = None, streaming_callback: Callable[[int], None] | None = None)[source]¶

Bases: object

Python wrapper of the Jackhmmer binary.

query(input_fasta_path: str) → Sequence[Mapping[str, Any]][source]¶: Queries the database using Jackhmmer.

fast_conformation.msa_generation.msa_utils module¶

Common utilities for data pipeline tools.

fast_conformation.msa_generation.msa_utils.create_directory(path)[source]¶

fast_conformation.msa_generation.msa_utils.create_ram_disk()[source]¶

fast_conformation.msa_generation.msa_utils.read_fasta(file_path)[source]¶

fast_conformation.msa_generation.msa_utils.save_dict_to_fasta(seq_dict, output_path, jobname)[source]¶

fast_conformation.msa_generation.msa_utils.timing(msg: str)[source]¶

fast_conformation.msa_generation.msa_utils.tmpdir_manager(base_dir: str | None = None)[source]¶: Context manager that deletes a temporary directory on exit.

fast_conformation.msa_generation.pairmsa module¶

fast_conformation.msa_generation.pairmsa.get_uni_jackhmmer(msa, mtx, lab, filter_qid=0.15, filter_cov=0.5)[source]¶

Filters sequences to retain only UniProt entries from a multiple sequence alignment (MSA).

Parameters:

msa (list of str) – List of sequences in the MSA.
mtx (list of list of int) – List of deletion matrices corresponding to the MSA.
lab (list of str) – List of labels corresponding to the MSA.
filter_qid (float) – Minimum sequence identity to retain a sequence. Default is 0.15.
filter_cov (float) – Minimum coverage to retain a sequence. Default is 0.5.

Returns:

Filtered (msa, mtx, lab) where each is a list.

Return type:

tuple

fast_conformation.msa_generation.pairmsa.hash_it(_seq, _lab, _mtx, call_uniprot=False)[source]¶

Generates a hash for a given sequence and label.

Parameters:

_seq (list of str) – List of sequences.
_lab (list of str) – List of labels corresponding to the sequences.
_mtx (list of list of int) – List of deletion matrices corresponding to the sequences.
call_uniprot (bool) – Whether to query UniProt for mapping information. Default is False.

Returns:

Contains mappings of sequences, labels, and hashes.

Return type:

dict

fast_conformation.msa_generation.pairmsa.map_retrieve(ids, call_uniprot=False)[source]¶

Maps UniRef IDs to UniProt accession numbers.

Parameters:

ids (list of str) – List of UniRef IDs.
call_uniprot (bool) – Whether to query UniProt for mapping information. Default is False.

Returns:

Mapping from UniRef IDs to UniProt accession numbers.

Return type:

dict

fast_conformation.msa_generation.pairmsa.parse_a3m(a3m_lines=None, a3m_file=None, filter_qid=0.15, filter_cov=0.5, N=100000)[source]¶

Parses an A3M file or lines and filters sequences based on sequence identity and coverage.

Parameters:

a3m_lines (list of str) – Lines from an A3M file (optional).
a3m_file (str) – Path to an A3M file (optional).
filter_qid (float) – Minimum sequence identity to retain a sequence. Default is 0.15.
filter_cov (float) – Minimum coverage to retain a sequence. Default is 0.5.
N (int) – Maximum number of sequences to retain. Default is 100000.

Returns:

(sequences, deletion_matrices, names) where each is a list.

Return type:

tuple

fast_conformation.msa_generation.pairmsa.stitch(_hash_a, _hash_b, stitch_min=1, stitch_max=20, filter_id=None)[source]¶

Stitches two hashed sequences together based on their alignment.

Parameters:

_hash_a (dict) – First sequence hash information.
_hash_b (dict) – Second sequence hash information.
stitch_min (int) – Minimum allowed distance between aligned sequences. Default is 1.
stitch_max (int) – Maximum allowed distance between aligned sequences. Default is 20.
filter_id (None) – Placeholder for a potential filtering ID (not used).

Returns:

(sequences, deletion matrices) for the stitched sequences.

Return type:

tuple

fast_conformation.msa_generation.pairmsa.uni_num(ids)[source]¶

Converts UniProt IDs to numerical representations.

Parameters:: ids (list of str) – List of UniProt IDs.
Returns:: Numerical representations of the UniProt IDs.
Return type:: list of int

fast_conformation.msa_generation.parsers module¶

Functions for parsing various file formats.

class fast_conformation.msa_generation.parsers.TemplateHit(index: int, name: str, aligned_cols: int, sum_probs: float, query: str, hit_sequence: str, indices_query: List[int], indices_hit: List[int])[source]¶

Bases: object

Class representing a template hit.

aligned_cols: int¶

hit_sequence: str¶

index: int¶

indices_hit: List[int]¶

indices_query: List[int]¶

name: str¶

query: str¶

sum_probs: float¶

fast_conformation.msa_generation.parsers.convert_stockholm_to_a3m(stockholm_format: str, max_sequences: int | None = None) → str[source]¶: Converts MSA in Stockholm format to the A3M format.

fast_conformation.msa_generation.parsers.parse_a3m(a3m_string: str) → Tuple[Sequence[str], Sequence[Sequence[int]]][source]¶

Parses sequences and deletion matrix from a3m format alignment.

Parameters:

a3m_string – The string contents of a a3m file. The first sequence in the file should be the query sequence.

Returns:

A list of sequences that have been aligned to the query. These might contain duplicates.
The deletion matrix for the alignment as a list of lists. The element at deletion_matrix[i][j] is the number of residues deleted from the aligned sequence i at residue position j.

Return type:

A tuple of

fast_conformation.msa_generation.parsers.parse_e_values_from_tblout(tblout: str) → Dict[str, float][source]¶: Parse target to e-value mapping parsed from Jackhmmer tblout string.

fast_conformation.msa_generation.parsers.parse_fasta(fasta_string: str) → Tuple[Sequence[str], Sequence[str]][source]¶

Parses FASTA string and returns list of strings with amino-acid sequences.

Parameters:

fasta_string – The string contents of a FASTA file.

Returns:

A list of sequences.
A list of sequence descriptions taken from the comment lines. In the same order as the sequences.

Return type:

A tuple of two lists

fast_conformation.msa_generation.parsers.parse_hhr(hhr_string: str) → Sequence[TemplateHit][source]¶: Parses the content of an entire HHR file.

fast_conformation.msa_generation.parsers.parse_stockholm(stockholm_string: str) → Tuple[Sequence[str], Sequence[Sequence[int]], Sequence[str]][source]¶

Parses sequences and deletion matrix from stockholm format alignment.

Parameters:

stockholm_string – The string contents of a stockholm file. The first sequence in the file should be the query sequence.

Returns:

A list of sequences that have been aligned to the query. These might contain duplicates.
The deletion matrix for the alignment as a list of lists. The element at deletion_matrix[i][j] is the number of residues deleted from the aligned sequence i at residue position j.
The names of the targets matched, including the jackhmmer subsequence suffix.

Return type:

A tuple of

fast_conformation.msa_generation package¶

Submodules¶

fast_conformation.msa_generation.colabfold module¶

fast_conformation.msa_generation.get_msa_jackhmmer module¶

fast_conformation.msa_generation.jackhmmer module¶

fast_conformation.msa_generation.msa_utils module¶

fast_conformation.msa_generation.pairmsa module¶

fast_conformation.msa_generation.parsers module¶

Module contents¶