fast_conformation.msa_generation package¶
Submodules¶
fast_conformation.msa_generation.colabfold module¶
- fast_conformation.msa_generation.colabfold.chain_break(idx_res, Ls, length=200)[source]¶
Adds a large number to residue indices to indicate chain breaks in a sequence.
- Parameters:
idx_res (ndarray) – The array of residue indices.
Ls (list of int) – The lengths of different segments in the sequence.
length (int) – The value to add to the residue index at chain breaks.
- Returns:
The updated array of residue indices with chain breaks.
- Return type:
ndarray
- fast_conformation.msa_generation.colabfold.get_hash(x)[source]¶
Generate a SHA-1 hash for a given string.
- Parameters:
x (str) – The input string to be hashed.
- Returns:
The SHA-1 hash of the input string.
- Return type:
str
- fast_conformation.msa_generation.colabfold.homooligomerize(msas, deletion_matrices, homooligomer=1)[source]¶
Homooligomerizes the input MSAs (Multiple Sequence Alignments) and deletion matrices.
- Parameters:
msas (list of lists) – A list of MSAs.
deletion_matrices (list of lists) – A list of deletion matrices corresponding to the MSAs.
homooligomer (int) – The number of homooligomeric copies. Default is 1 (no homooligomerization).
- Returns:
A tuple containing the homooligomerized MSAs and deletion matrices.
- Return type:
tuple
- fast_conformation.msa_generation.colabfold.homooligomerize_heterooligomer(msas, deletion_matrices, lengths, homooligomers)[source]¶
Homooligomerizes the input MSAs and deletion matrices for heterooligomeric complexes.
- Parameters:
msas (list of lists) – A list of MSAs.
deletion_matrices (list of lists) – A list of deletion matrices corresponding to the MSAs.
lengths (list of int) – A list of lengths for each component in the complex.
homooligomers (list of int) – A list of homooligomeric copies for each component.
- Returns:
A tuple containing the homooligomerized MSAs and deletion matrices.
- Return type:
tuple
- fast_conformation.msa_generation.colabfold.homooliomerize(msas, deletion_matrices, homooligomer=1)[source]¶
Homooligomerizes the input MSAs and deletion matrices. This function is a typo version of homooligomerize for cross-compatibility.
- Parameters:
msas (list of lists) – A list of MSAs.
deletion_matrices (list of lists) – A list of deletion matrices corresponding to the MSAs.
homooligomer (int) – The number of homooligomeric copies. Default is 1 (no homooligomerization).
- Returns:
A tuple containing the homooligomerized MSAs and deletion matrices.
- Return type:
tuple
- fast_conformation.msa_generation.colabfold.plot_confidence(plddt, pae=None, Ls=None, dpi=100)[source]¶
Plots predicted confidence metrics (pLDDT and PAE) for a protein structure.
- Parameters:
plddt (ndarray) – Array of predicted Local Distance Difference Test (pLDDT) scores.
pae (ndarray) – Array of Predicted Aligned Error (PAE) scores (optional).
Ls (list of int) – The lengths of different segments in the sequence (optional).
dpi (int) – Dots per inch setting for the plot.
- Returns:
The plot object displaying the confidence metrics.
- Return type:
matplotlib.pyplot
- fast_conformation.msa_generation.colabfold.plot_msas(msas, ori_seq=None, sort_by_seqid=True, deduplicate=True, dpi=100, return_plt=True)[source]¶
Plots Multiple Sequence Alignments (MSAs).
- Parameters:
msas (list of lists) – A list of MSAs to be plotted.
ori_seq (str) – The original sequence (optional).
sort_by_seqid (bool) – Whether to sort sequences by sequence identity (default: True).
deduplicate (bool) – Whether to remove duplicate sequences (default: True).
dpi (int) – Dots per inch setting for the plot.
return_plt (bool) – Whether to return the plot object (default: True).
- Returns:
The plot object displaying the MSAs, if return_plt is True.
- Return type:
matplotlib.pyplot
fast_conformation.msa_generation.get_msa_jackhmmer module¶
- fast_conformation.msa_generation.get_msa_jackhmmer.prep_inputs(sequence, jobname='test', homooligomer='1', output_dir=None, clean=False, verbose=True)[source]¶
Prepares the input sequence and parameters for MSA generation.
- Parameters:
sequence (str) – The protein sequence to be processed.
jobname (str) – The name of the job. Default is “test”.
homooligomer (str) – A string specifying the number of homooligomers for each sequence segment. Default is “1”.
output_dir (str) – The directory where output files will be saved. If None, a default directory is created based on the jobname and sequence hash.
clean (bool) – If True, cleans the output directory by removing existing files. Default is False.
verbose (bool) – If True, prints warnings and information during execution. Default is True.
- Returns:
- A dictionary containing the processed inputs, including sequences, homooligomer information,
and output directory.
- Return type:
dict
- fast_conformation.msa_generation.get_msa_jackhmmer.prep_msa(I, msa_method='mmseqs2', add_custom_msa=False, msa_format='fas', pair_mode='unpaired', pair_cov=50, pair_qid=20, hhfilter_loc='hhfilter', reformat_loc='reformat.pl', TMP_DIR='tmp', custom_msa=None, precomputed=None, mmseqs_host_url='https://a3m.mmseqs.com', verbose=True, use_ramdisk=False)[source]¶
Prepares and processes MSAs for the given sequences using the specified MSA method.
- Parameters:
I (dict) – Dictionary containing input sequences and other parameters.
msa_method (str) – Method used to generate MSAs. Default is “mmseqs2”.
add_custom_msa (bool) – Whether to add a custom MSA. Default is False.
msa_format (str) – The format of the MSA file. Default is “fas”.
pair_mode (str) – Pairing mode for sequences. Can be “unpaired”, “paired”, or “unpaired+paired”. Default is “unpaired”.
pair_cov (int) – Coverage threshold for pairing sequences. Default is 50.
pair_qid (int) – Identity threshold for pairing sequences. Default is 20.
hhfilter_loc (str) – Path to the hhfilter binary. Default is “hhfilter”.
reformat_loc (str) – Path to the reformat.pl script. Default is “reformat.pl”.
TMP_DIR (str) – Path to the temporary directory. Default is “tmp”.
custom_msa (str) – Path to a custom MSA file (optional).
precomputed (str) – Path to a precomputed MSA file (optional).
mmseqs_host_url (str) – URL of the MMseqs2 server. Default is “https://a3m.mmseqs.com”.
verbose (bool) – If True, prints progress and information during execution. Default is True.
use_ramdisk (bool) – If True, uses a RAM disk for temporary storage. Default is False.
- Returns:
The updated dictionary I containing the generated MSAs and deletion matrices.
- Return type:
dict
- fast_conformation.msa_generation.get_msa_jackhmmer.run_jackhmmer(sequence, prefix, jackhmmer_binary_path='jackhmmer', verbose=True, use_ramdisk=False)[source]¶
Runs the jackhmmer tool to search for homologous sequences in a protein sequence database.
- Parameters:
sequence (str) – The query protein sequence.
prefix (str) – The prefix for output files.
jackhmmer_binary_path (str) – Path to the jackhmmer binary executable. Default is ‘jackhmmer’.
verbose (bool) – If True, prints progress and information during execution. Default is True.
use_ramdisk (bool) – If True, uses a RAM disk for temporary storage. Default is False.
- Returns:
A tuple containing the MSAs, deletion matrices, and sequence names.
- Return type:
tuple
fast_conformation.msa_generation.jackhmmer module¶
Library to run Jackhmmer from Python.
- class fast_conformation.msa_generation.jackhmmer.Jackhmmer(*, binary_path: str, database_path: str, use_ramdisk: bool = False, n_cpu: int = 8, n_iter: int = 1, e_value: float = 0.0001, z_value: int | None = None, get_tblout: bool = False, filter_f1: float = 5e-09, filter_f2: float = 5e-13, filter_f3: float = 5e-15, incdom_e: float | None = None, dom_e: float | None = None, num_streamed_chunks: int | None = None, streaming_callback: Callable[[int], None] | None = None)[source]¶
Bases:
objectPython wrapper of the Jackhmmer binary.
fast_conformation.msa_generation.msa_utils module¶
Common utilities for data pipeline tools.
fast_conformation.msa_generation.pairmsa module¶
- fast_conformation.msa_generation.pairmsa.get_uni_jackhmmer(msa, mtx, lab, filter_qid=0.15, filter_cov=0.5)[source]¶
Filters sequences to retain only UniProt entries from a multiple sequence alignment (MSA).
- Parameters:
msa (list of str) – List of sequences in the MSA.
mtx (list of list of int) – List of deletion matrices corresponding to the MSA.
lab (list of str) – List of labels corresponding to the MSA.
filter_qid (float) – Minimum sequence identity to retain a sequence. Default is 0.15.
filter_cov (float) – Minimum coverage to retain a sequence. Default is 0.5.
- Returns:
Filtered (msa, mtx, lab) where each is a list.
- Return type:
tuple
- fast_conformation.msa_generation.pairmsa.hash_it(_seq, _lab, _mtx, call_uniprot=False)[source]¶
Generates a hash for a given sequence and label.
- Parameters:
_seq (list of str) – List of sequences.
_lab (list of str) – List of labels corresponding to the sequences.
_mtx (list of list of int) – List of deletion matrices corresponding to the sequences.
call_uniprot (bool) – Whether to query UniProt for mapping information. Default is False.
- Returns:
Contains mappings of sequences, labels, and hashes.
- Return type:
dict
- fast_conformation.msa_generation.pairmsa.map_retrieve(ids, call_uniprot=False)[source]¶
Maps UniRef IDs to UniProt accession numbers.
- Parameters:
ids (list of str) – List of UniRef IDs.
call_uniprot (bool) – Whether to query UniProt for mapping information. Default is False.
- Returns:
Mapping from UniRef IDs to UniProt accession numbers.
- Return type:
dict
- fast_conformation.msa_generation.pairmsa.parse_a3m(a3m_lines=None, a3m_file=None, filter_qid=0.15, filter_cov=0.5, N=100000)[source]¶
Parses an A3M file or lines and filters sequences based on sequence identity and coverage.
- Parameters:
a3m_lines (list of str) – Lines from an A3M file (optional).
a3m_file (str) – Path to an A3M file (optional).
filter_qid (float) – Minimum sequence identity to retain a sequence. Default is 0.15.
filter_cov (float) – Minimum coverage to retain a sequence. Default is 0.5.
N (int) – Maximum number of sequences to retain. Default is 100000.
- Returns:
(sequences, deletion_matrices, names) where each is a list.
- Return type:
tuple
- fast_conformation.msa_generation.pairmsa.stitch(_hash_a, _hash_b, stitch_min=1, stitch_max=20, filter_id=None)[source]¶
Stitches two hashed sequences together based on their alignment.
- Parameters:
_hash_a (dict) – First sequence hash information.
_hash_b (dict) – Second sequence hash information.
stitch_min (int) – Minimum allowed distance between aligned sequences. Default is 1.
stitch_max (int) – Maximum allowed distance between aligned sequences. Default is 20.
filter_id (None) – Placeholder for a potential filtering ID (not used).
- Returns:
(sequences, deletion matrices) for the stitched sequences.
- Return type:
tuple
fast_conformation.msa_generation.parsers module¶
Functions for parsing various file formats.
- class fast_conformation.msa_generation.parsers.TemplateHit(index: int, name: str, aligned_cols: int, sum_probs: float, query: str, hit_sequence: str, indices_query: List[int], indices_hit: List[int])[source]¶
Bases:
objectClass representing a template hit.
- aligned_cols: int¶
- hit_sequence: str¶
- index: int¶
- indices_hit: List[int]¶
- indices_query: List[int]¶
- name: str¶
- query: str¶
- sum_probs: float¶
- fast_conformation.msa_generation.parsers.convert_stockholm_to_a3m(stockholm_format: str, max_sequences: int | None = None) str[source]¶
Converts MSA in Stockholm format to the A3M format.
- fast_conformation.msa_generation.parsers.parse_a3m(a3m_string: str) Tuple[Sequence[str], Sequence[Sequence[int]]][source]¶
Parses sequences and deletion matrix from a3m format alignment.
- Parameters:
a3m_string – The string contents of a a3m file. The first sequence in the file should be the query sequence.
- Returns:
A list of sequences that have been aligned to the query. These might contain duplicates.
The deletion matrix for the alignment as a list of lists. The element at deletion_matrix[i][j] is the number of residues deleted from the aligned sequence i at residue position j.
- Return type:
A tuple of
- fast_conformation.msa_generation.parsers.parse_e_values_from_tblout(tblout: str) Dict[str, float][source]¶
Parse target to e-value mapping parsed from Jackhmmer tblout string.
- fast_conformation.msa_generation.parsers.parse_fasta(fasta_string: str) Tuple[Sequence[str], Sequence[str]][source]¶
Parses FASTA string and returns list of strings with amino-acid sequences.
- Parameters:
fasta_string – The string contents of a FASTA file.
- Returns:
A list of sequences.
A list of sequence descriptions taken from the comment lines. In the same order as the sequences.
- Return type:
A tuple of two lists
- fast_conformation.msa_generation.parsers.parse_hhr(hhr_string: str) Sequence[TemplateHit][source]¶
Parses the content of an entire HHR file.
- fast_conformation.msa_generation.parsers.parse_stockholm(stockholm_string: str) Tuple[Sequence[str], Sequence[Sequence[int]], Sequence[str]][source]¶
Parses sequences and deletion matrix from stockholm format alignment.
- Parameters:
stockholm_string – The string contents of a stockholm file. The first sequence in the file should be the query sequence.
- Returns:
A list of sequences that have been aligned to the query. These might contain duplicates.
The deletion matrix for the alignment as a list of lists. The element at deletion_matrix[i][j] is the number of residues deleted from the aligned sequence i at residue position j.
The names of the targets matched, including the jackhmmer subsequence suffix.
- Return type:
A tuple of