multimodal_fin.embeddings.builder package

Submodules

multimodal_fin.embeddings.builder.conference_encoder module

class multimodal_fin.embeddings.builder.conference_encoder.ConferenceEncoder(device='cpu', input_dim=512, hidden_dim=256, n_heads=4, d_output=512, max_nodes=1000, weights_path=None)[source]

Bases: Module

Encoder that aggregates node-level embeddings into a single conference-level embedding using a Transformer encoder with a [CLS] token and learned positional encodings.

forward(node_embeddings, return_attn=False)[source]

Parameters:

node_embeddings (Tensor) – Tensor of shape [n_nodes, input_dim]
return_attn (bool) – Whether to return attention weights from [CLS] token.

Return type:

Tuple[Tensor, Optional[Tensor]]

Returns:

Conference embedding of shape [1, d_output] Optionally, attention weights from [CLS] to all other nodes.

multimodal_fin.embeddings.builder.feature_extractor module

class multimodal_fin.embeddings.builder.feature_extractor.FeatureExtractor(categories_10k=None, qa_categories=None, max_num_coherences=5)[source]

Bases: object

Extracts multimodal (text, audio, video) and metadata features from a conference tree node. Converts data into tensors suitable for model input.

extract(node)[source]

Extracts a multimodal tensor and metadata vector from a tree node.

Parameters:: node – A ConferenceNode object containing multimodal data and metadata.
Returns:: Tensor of shape [1, n, 21] with concatenated features. mask: Boolean tensor of shape [1, n] indicating valid time steps. meta_vec: Array of metadata of shape [expected_size].
Return type:: Tuple[Tensor, Tensor, ndarray]

get_array_from_embedding(emb_data, n_target)[source]

Converts raw embeddings into a padded NumPy array of shape [n_target, 7].

Parameters:

emb_data (Union[List, dict]) – List or dict of raw embeddings.
n_target (int) – Desired number of time steps (padding/truncating applied).

Return type:

ndarray

Returns:

A NumPy array of shape [n_target, 7].

safe_len(emb)[source]

Safely computes the length of embeddings regardless of structure.

Return type:: int

to_onehot(value, options)[source]

Converts a categorical value to a one-hot encoded vector.

Return type:: ndarray

to_onehot_bool(value)[source]

Encodes a boolean value as a 1-hot vector [1, 0] or [0, 1].

Return type:: ndarray

multimodal_fin.embeddings.builder.node_encoder module

class multimodal_fin.embeddings.builder.node_encoder.NodeEncoder(device='cpu', input_dim=21, hidden_dim=128, meta_dim=32, d_output=512, n_heads=4, categories_10k=None, qa_categories=None, weights_path='weights/node_encoder.pt')[source]

Bases: Module

Encodes individual nodes in a conference tree using multimodal features and metadata.

frase_encoder

Encoder for sentence-level features using attention.

Type:: nn.Module

meta_proj

Projection layer for metadata features.

Type:: nn.Linear

output_proj

Final projection layer to produce node embedding.

Type:: nn.Linear

categories_10k

List of 10-K classification categories.

Type:: List[str]

qa_categories

List of QA response categories.

Type:: List[str]

max_num_coherences

Maximum number of coherence entries per node.

Type:: int

multimodal_fin.embeddings.builder.pipeline module

class multimodal_fin.embeddings.builder.pipeline.ConferenceEmbeddingPipeline(node_encoder_params, conference_encoder_params, device='cpu')[source]

Bases: object

Orchestrates the generation and visualization of conference-level embeddings.

generate_embedding(json_path, return_attn=False)[source]

Generates the embedding for a given conference JSON.

Parameters:

json_path (str) – Path to the JSON file describing the conference.
return_attn (bool) – Whether to return attention weights.

Returns:

Embedding vector for the full conference.

Return type:

Tensor

visualize(plots=None)[source]

Visualizes the results of the embedding process depending on selected plots.

Parameters:: plots (dict) – Flags for which plots to generate.

multimodal_fin.embeddings.builder.sentence_attention_encoder module

class multimodal_fin.embeddings.builder.sentence_attention_encoder.SentenceAttentionEncoder(input_dim=21, hidden_dim=128, n_heads=4, dropout=0.1)[source]

Bases: Module

Encodes a sequence of token-level embeddings into a sentence-level embedding using self-attention. A learnable [CLS] token is prepended to attend over the sequence.

forward(x, mask=None, return_weights=False)[source]

Forward pass for the sentence encoder.

Parameters:

x (Tensor) – Input tensor of shape [B, N, input_dim], where B is batch size, N is sequence length.
mask (Optional[Tensor]) – Optional mask of shape [B, N] indicating valid tokens (1) vs padding (0).
return_weights (bool) – Whether to return attention weights from [CLS] token to input tokens.

Returns:

If return_weights is False: tensor of shape [B, hidden_dim] representing sentence-level embeddings.
If return_weights is True: tuple (embeddings, attention_weights), where:
- embeddings: [B, hidden_dim]
- attention_weights: [B, N] average attention from CLS to tokens

Return type:

Tuple[Tensor, Optional[Tensor]]

multimodal_fin.embeddings.builder.transformer_encoder module

class multimodal_fin.embeddings.builder.transformer_encoder.TransformerEncoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1)[source]

Bases: Module

Custom Transformer encoder layer with self-attention, feedforward network, residual connections and layer normalization.

forward(src, src_mask=None, src_key_padding_mask=None)[source]

Forward pass of the transformer encoder layer.

Parameters:

src (Tensor) – Input tensor of shape [B, T, d_model].
src_mask (Optional[Tensor]) – Optional attention mask [T, T] or [B * num_heads, T, T].
src_key_padding_mask (Optional[Tensor]) – Optional mask [B, T] indicating padding positions.

Return type:

Tensor

Returns:

Output tensor of shape [B, T, d_model].

multimodal_fin.embeddings.builder package

Submodules

multimodal_fin.embeddings.builder.conference_encoder module

multimodal_fin.embeddings.builder.feature_extractor module

multimodal_fin.embeddings.builder.node_encoder module

multimodal_fin.embeddings.builder.pipeline module

multimodal_fin.embeddings.builder.sentence_attention_encoder module

multimodal_fin.embeddings.builder.transformer_encoder module

Module contents