multimodal_fin.embeddings.builder package

Submodules

multimodal_fin.embeddings.builder.conference_encoder module

class multimodal_fin.embeddings.builder.conference_encoder.ConferenceEncoder(device='cpu', input_dim=512, hidden_dim=256, n_heads=4, d_output=512, max_nodes=1000, weights_path=None)[source]

Bases: Module

Encoder that aggregates node-level embeddings into a single conference-level embedding using a Transformer encoder with a [CLS] token and learned positional encodings.

forward(node_embeddings, return_attn=False)[source]
Parameters:
  • node_embeddings (Tensor) – Tensor of shape [n_nodes, input_dim]

  • return_attn (bool) – Whether to return attention weights from [CLS] token.

Return type:

Tuple[Tensor, Optional[Tensor]]

Returns:

Conference embedding of shape [1, d_output] Optionally, attention weights from [CLS] to all other nodes.

multimodal_fin.embeddings.builder.feature_extractor module

class multimodal_fin.embeddings.builder.feature_extractor.FeatureExtractor(categories_10k=None, qa_categories=None, max_num_coherences=5)[source]

Bases: object

Extracts multimodal (text, audio, video) and metadata features from a conference tree node. Converts data into tensors suitable for model input.

extract(node)[source]

Extracts a multimodal tensor and metadata vector from a tree node.

Parameters:

node – A ConferenceNode object containing multimodal data and metadata.

Returns:

Tensor of shape [1, n, 21] with concatenated features. mask: Boolean tensor of shape [1, n] indicating valid time steps. meta_vec: Array of metadata of shape [expected_size].

Return type:

Tuple[Tensor, Tensor, ndarray]

get_array_from_embedding(emb_data, n_target)[source]

Converts raw embeddings into a padded NumPy array of shape [n_target, 7].

Parameters:
  • emb_data (Union[List, dict]) – List or dict of raw embeddings.

  • n_target (int) – Desired number of time steps (padding/truncating applied).

Return type:

ndarray

Returns:

A NumPy array of shape [n_target, 7].

safe_len(emb)[source]

Safely computes the length of embeddings regardless of structure.

Return type:

int

to_onehot(value, options)[source]

Converts a categorical value to a one-hot encoded vector.

Return type:

ndarray

to_onehot_bool(value)[source]

Encodes a boolean value as a 1-hot vector [1, 0] or [0, 1].

Return type:

ndarray

multimodal_fin.embeddings.builder.node_encoder module

class multimodal_fin.embeddings.builder.node_encoder.NodeEncoder(device='cpu', input_dim=21, hidden_dim=128, meta_dim=32, d_output=512, n_heads=4, categories_10k=None, qa_categories=None, weights_path='weights/node_encoder.pt')[source]

Bases: Module

Encodes individual nodes in a conference tree using multimodal features and metadata.

frase_encoder

Encoder for sentence-level features using attention.

Type:

nn.Module

meta_proj

Projection layer for metadata features.

Type:

nn.Linear

output_proj

Final projection layer to produce node embedding.

Type:

nn.Linear

categories_10k

List of 10-K classification categories.

Type:

List[str]

qa_categories

List of QA response categories.

Type:

List[str]

max_num_coherences

Maximum number of coherence entries per node.

Type:

int

multimodal_fin.embeddings.builder.pipeline module

class multimodal_fin.embeddings.builder.pipeline.ConferenceEmbeddingPipeline(node_encoder_params, conference_encoder_params, device='cpu')[source]

Bases: object

Orchestrates the generation and visualization of conference-level embeddings.

generate_embedding(json_path, return_attn=False)[source]

Generates the embedding for a given conference JSON.

Parameters:
  • json_path (str) – Path to the JSON file describing the conference.

  • return_attn (bool) – Whether to return attention weights.

Returns:

Embedding vector for the full conference.

Return type:

Tensor

visualize(plots=None)[source]

Visualizes the results of the embedding process depending on selected plots.

Parameters:

plots (dict) – Flags for which plots to generate.

multimodal_fin.embeddings.builder.sentence_attention_encoder module

class multimodal_fin.embeddings.builder.sentence_attention_encoder.SentenceAttentionEncoder(input_dim=21, hidden_dim=128, n_heads=4, dropout=0.1)[source]

Bases: Module

Encodes a sequence of token-level embeddings into a sentence-level embedding using self-attention. A learnable [CLS] token is prepended to attend over the sequence.

forward(x, mask=None, return_weights=False)[source]

Forward pass for the sentence encoder.

Parameters:
  • x (Tensor) – Input tensor of shape [B, N, input_dim], where B is batch size, N is sequence length.

  • mask (Optional[Tensor]) – Optional mask of shape [B, N] indicating valid tokens (1) vs padding (0).

  • return_weights (bool) – Whether to return attention weights from [CLS] token to input tokens.

Returns:

  • If return_weights is False: tensor of shape [B, hidden_dim] representing sentence-level embeddings.

  • If return_weights is True: tuple (embeddings, attention_weights), where:
    • embeddings: [B, hidden_dim]

    • attention_weights: [B, N] average attention from CLS to tokens

Return type:

Tuple[Tensor, Optional[Tensor]]

multimodal_fin.embeddings.builder.transformer_encoder module

class multimodal_fin.embeddings.builder.transformer_encoder.TransformerEncoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1)[source]

Bases: Module

Custom Transformer encoder layer with self-attention, feedforward network, residual connections and layer normalization.

forward(src, src_mask=None, src_key_padding_mask=None)[source]

Forward pass of the transformer encoder layer.

Parameters:
  • src (Tensor) – Input tensor of shape [B, T, d_model].

  • src_mask (Optional[Tensor]) – Optional attention mask [T, T] or [B * num_heads, T, T].

  • src_key_padding_mask (Optional[Tensor]) – Optional mask [B, T] indicating padding positions.

Return type:

Tensor

Returns:

Output tensor of shape [B, T, d_model].

Module contents