multimodal_fin.embeddings.builder package
Submodules
multimodal_fin.embeddings.builder.conference_encoder module
- class multimodal_fin.embeddings.builder.conference_encoder.ConferenceEncoder(device='cpu', input_dim=512, hidden_dim=256, n_heads=4, d_output=512, max_nodes=1000, weights_path=None)[source]
Bases:
ModuleEncoder that aggregates node-level embeddings into a single conference-level embedding using a Transformer encoder with a [CLS] token and learned positional encodings.
- forward(node_embeddings, return_attn=False)[source]
- Parameters:
node_embeddings (
Tensor) – Tensor of shape [n_nodes, input_dim]return_attn (
bool) – Whether to return attention weights from [CLS] token.
- Return type:
Tuple[Tensor,Optional[Tensor]]- Returns:
Conference embedding of shape [1, d_output] Optionally, attention weights from [CLS] to all other nodes.
multimodal_fin.embeddings.builder.feature_extractor module
- class multimodal_fin.embeddings.builder.feature_extractor.FeatureExtractor(categories_10k=None, qa_categories=None, max_num_coherences=5)[source]
Bases:
objectExtracts multimodal (text, audio, video) and metadata features from a conference tree node. Converts data into tensors suitable for model input.
- extract(node)[source]
Extracts a multimodal tensor and metadata vector from a tree node.
- Parameters:
node – A ConferenceNode object containing multimodal data and metadata.
- Returns:
Tensor of shape [1, n, 21] with concatenated features. mask: Boolean tensor of shape [1, n] indicating valid time steps. meta_vec: Array of metadata of shape [expected_size].
- Return type:
Tuple[Tensor,Tensor,ndarray]
- get_array_from_embedding(emb_data, n_target)[source]
Converts raw embeddings into a padded NumPy array of shape [n_target, 7].
- Parameters:
emb_data (
Union[List,dict]) – List or dict of raw embeddings.n_target (
int) – Desired number of time steps (padding/truncating applied).
- Return type:
ndarray- Returns:
A NumPy array of shape [n_target, 7].
- safe_len(emb)[source]
Safely computes the length of embeddings regardless of structure.
- Return type:
int
multimodal_fin.embeddings.builder.node_encoder module
- class multimodal_fin.embeddings.builder.node_encoder.NodeEncoder(device='cpu', input_dim=21, hidden_dim=128, meta_dim=32, d_output=512, n_heads=4, categories_10k=None, qa_categories=None, weights_path='weights/node_encoder.pt')[source]
Bases:
ModuleEncodes individual nodes in a conference tree using multimodal features and metadata.
- frase_encoder
Encoder for sentence-level features using attention.
- Type:
nn.Module
- meta_proj
Projection layer for metadata features.
- Type:
nn.Linear
- output_proj
Final projection layer to produce node embedding.
- Type:
nn.Linear
- categories_10k
List of 10-K classification categories.
- Type:
List[str]
- qa_categories
List of QA response categories.
- Type:
List[str]
- max_num_coherences
Maximum number of coherence entries per node.
- Type:
int
multimodal_fin.embeddings.builder.pipeline module
- class multimodal_fin.embeddings.builder.pipeline.ConferenceEmbeddingPipeline(node_encoder_params, conference_encoder_params, device='cpu')[source]
Bases:
objectOrchestrates the generation and visualization of conference-level embeddings.
- generate_embedding(json_path, return_attn=False)[source]
Generates the embedding for a given conference JSON.
- Parameters:
json_path (
str) – Path to the JSON file describing the conference.return_attn (
bool) – Whether to return attention weights.
- Returns:
Embedding vector for the full conference.
- Return type:
Tensor
multimodal_fin.embeddings.builder.sentence_attention_encoder module
- class multimodal_fin.embeddings.builder.sentence_attention_encoder.SentenceAttentionEncoder(input_dim=21, hidden_dim=128, n_heads=4, dropout=0.1)[source]
Bases:
ModuleEncodes a sequence of token-level embeddings into a sentence-level embedding using self-attention. A learnable [CLS] token is prepended to attend over the sequence.
- forward(x, mask=None, return_weights=False)[source]
Forward pass for the sentence encoder.
- Parameters:
x (
Tensor) – Input tensor of shape [B, N, input_dim], where B is batch size, N is sequence length.mask (
Optional[Tensor]) – Optional mask of shape [B, N] indicating valid tokens (1) vs padding (0).return_weights (
bool) – Whether to return attention weights from [CLS] token to input tokens.
- Returns:
If return_weights is False: tensor of shape [B, hidden_dim] representing sentence-level embeddings.
- If return_weights is True: tuple (embeddings, attention_weights), where:
embeddings: [B, hidden_dim]
attention_weights: [B, N] average attention from CLS to tokens
- Return type:
Tuple[Tensor,Optional[Tensor]]
multimodal_fin.embeddings.builder.transformer_encoder module
- class multimodal_fin.embeddings.builder.transformer_encoder.TransformerEncoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1)[source]
Bases:
ModuleCustom Transformer encoder layer with self-attention, feedforward network, residual connections and layer normalization.
- forward(src, src_mask=None, src_key_padding_mask=None)[source]
Forward pass of the transformer encoder layer.
- Parameters:
src (
Tensor) – Input tensor of shape [B, T, d_model].src_mask (
Optional[Tensor]) – Optional attention mask [T, T] or [B * num_heads, T, T].src_key_padding_mask (
Optional[Tensor]) – Optional mask [B, T] indicating padding positions.
- Return type:
Tensor- Returns:
Output tensor of shape [B, T, d_model].