multimodal_fin.processing.preprocessing package

Submodules

multimodal_fin.processing.preprocessing.ensemble_classifier module

class multimodal_fin.processing.preprocessing.ensemble_classifier.EnsembleInterventionClassifier(qa_model_names, monologue_model_names, NUM_EVALUATIONS=5, verbose=1)[source]

Bases: object

Combines multiple Q&A and monologue classifiers to label interventions in a transcript. Handles classification and pairing of questions and answers.

NUM_EVALUATIONS: int = 5: Number of repeated evaluations per classifier for stability.

annotate_question_answer_pairs(df)[source]

Assigns a unique ID to valid question-answer pairs in the transcript.

Parameters:: df (DataFrame) – DataFrame already classified by classify_dataframe.
Return type:: DataFrame
Returns:: DataFrame with an added ‘Pair’ column for Q&A associations and ‘intervention_id’.
Raises:: ValueError – If any detected pair does not contain exactly two elements.

classify_dataframe(df)[source]

Classifies each row in the transcript using an ensemble of classifiers.

Parameters:: df (DataFrame) – DataFrame containing the transcript with a ‘Conf_Section’ column.
Returns:: ‘classification’, ‘global_confidence’, and ‘model_predictions’.
Return type:: DataFrame

ensemble_predict(text, classifiers)[source]

Aggregates predictions from multiple classifiers for a given text.

Parameters:

text (str) – Text to classify.
classifiers (List) – List of classifiers (Q&A or monologue).

Return type:

Tuple[str, float, List[Tuple[str, str, float]]]

Returns:

Tuple with predicted category, average confidence, and individual model predictions.

monologue_model_names: List[str]: List of monologue classifier model names.

qa_model_names: List[str]: List of Q&A classifier model names.

verbose: int = 1: Verbosity level for logging.

multimodal_fin.processing.preprocessing.monologue_classifier module

class multimodal_fin.processing.preprocessing.monologue_classifier.CategoryPresentation(**data)[source]

Bases: BaseModel

Pydantic schema for monologue classification output.

category: Literal['Monologue', 'Procedure']

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class multimodal_fin.processing.preprocessing.monologue_classifier.MonologueClassifier(model='llama3', NUM_EVALUATIONS=5, output_path='output')[source]

Bases: UncertaintyMixin

Classifier for identifying monologue categories using an LLM and uncertainty estimation.

NUM_EVALUATIONS: int = 5

classify_dataframe(df)[source]

Classifies all interventions in a DataFrame.

Parameters:: df (DataFrame) – DataFrame containing a ‘text’ column with interventions.
Return type:: DataFrame
Returns:: DataFrame with an added ‘classification’ column.

classify_text(text)[source]

Classifies a single text as either ‘Monologue’ or ‘Procedure’.

Parameters:: text (str) – Input string representing a conference intervention.
Returns:: ‘Monologue’ or ‘Procedure’.
Return type:: str

get_pred(text)[source]

Performs multiple evaluations and computes uncertainty score.

Parameters:: text (str) – Input intervention text.
Return type:: Tuple[str, float]
Returns:: A tuple of (predicted_category, confidence_score).

model: str = 'llama3'

output_path: str = 'output'

multimodal_fin.processing.preprocessing.preprocessor module

class multimodal_fin.processing.preprocessing.preprocessor.Preprocessor(qa_model_names, monologue_model_names, num_evaluations=5, verbose=1, section_col='Conf_Section', text_col='text', qna_key='questions_and_answers')[source]

Bases: object

Handles the full transcript preprocessing pipeline for a financial conference:

Steps:

Section segmentation between ‘prepared_remarks’ and ‘q_a’.
Classification using ensemble of Q&A and monologue classifiers.
Annotation of question-answer pairs.

divide_conference(csv_path, json_path)[source]

Assigns sections (‘prepared_remarks’ or ‘q_a’) to each row based on intro location.

Parameters:

csv_path (str) – Path to transcript CSV.
json_path (str) – Path to LEVEL_4.json with Q&A intro.

Return type:

DataFrame

Returns:

DataFrame with new section column.

extract_qna_intro(json_path)[source]

Extracts the first sentence of the Q&A section from the provided JSON.

Parameters:: json_path (str) – Path to the LEVEL_4.json file.
Return type:: str | None
Returns:: First sentence of Q&A section or None if not found.

monologue_model_names: List[str]

num_evaluations: int = 5

process(csv_path, json_path)[source]

Executes sectioning, classification, and annotation pipeline.

Parameters:

csv_path (str) – Path to transcript CSV.
json_path (str) – Path to LEVEL_4.json

Return type:

DataFrame

Returns:

Annotated and classified DataFrame.

process_and_save(csv_path, json_path, output_csv_path)[source]

Runs the preprocessing pipeline and saves the final DataFrame to CSV.

Parameters:

csv_path (str) – Input transcript CSV.
json_path (str) – LEVEL_4.json.
output_csv_path (str) – Path to save the processed CSV.

Return type:

DataFrame

Returns:

Final processed DataFrame.

qa_model_names: List[str]

qna_key: str = 'questions_and_answers'

section_col: str = 'Conf_Section'

text_col: str = 'text'

verbose: int = 1

multimodal_fin.processing.preprocessing.qa_classifier module

class multimodal_fin.processing.preprocessing.qa_classifier.CategoryQA(**data)[source]

Bases: BaseModel

Pydantic schema for Q&A classification output.

category: Literal['Question', 'Answer', 'Procedure']

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class multimodal_fin.processing.preprocessing.qa_classifier.QAClassifier(model='llama3', NUM_EVALUATIONS=5)[source]

Bases: UncertaintyMixin

Classifier for identifying Q&A intervention types using an LLM with uncertainty estimation.

NUM_EVALUATIONS: int = 5: Number of times to sample the model for uncertainty estimation.

classify_dataframe(df)[source]

Classifies all interventions in a DataFrame.

Parameters:: df (DataFrame) – DataFrame containing a ‘text’ column.
Return type:: DataFrame
Returns:: DataFrame with an added ‘classification’ column.

classify_text(text)[source]

Classifies a single intervention as ‘Question’, ‘Answer’ or ‘Procedure’.

Parameters:: text (str) – The input intervention text.
Return type:: str
Returns:: A string category predicted by the LLM.

get_pred(text)[source]

Performs multiple classifications and computes uncertainty score.

Parameters:: text (str) – Input intervention text.
Return type:: Tuple[str, float]
Returns:: A tuple of (predicted_category, confidence_score).

model: str = 'llama3': LLM model name.

multimodal_fin.processing.preprocessing.transcript_preprocessor module

class multimodal_fin.processing.preprocessing.transcript_preprocessor.TranscriptPreprocessor(section_col='Conf_Section', text_col='text', qna_key='questions_and_answers')[source]

Bases: object

Preprocesses a conference transcript by identifying the beginning of the Q&A section and labeling each row as either ‘prepared_remarks’ or ‘q_a’.

extract_qna_intro(json_path)[source]

Extracts the first sentence of the Q&A section from the metadata JSON.

Parameters:: json_path (str) – Path to the JSON metadata file.
Return type:: Optional[str]
Returns:: The first sentence of the Q&A intro, or None if not found or file is invalid.

preprocess(csv_path, json_path)[source]

Labels each row in the transcript CSV as either ‘prepared_remarks’ or ‘q_a’ based on the location of the Q&A intro.

Parameters:

csv_path (str) – Path to the transcript CSV file.
json_path (str) – Path to the metadata JSON file.

Return type:

DataFrame

Returns:

A DataFrame with an added column (section_col) containing the section labels.

qna_key: str = 'questions_and_answers': Key used to extract the Q&A intro from the JSON metadata.

section_col: str = 'Conf_Section': Name of the column to write the section labels.

text_col: str = 'text': Column containing the transcript text.

multimodal_fin.processing.preprocessing package

Submodules

multimodal_fin.processing.preprocessing.ensemble_classifier module

multimodal_fin.processing.preprocessing.monologue_classifier module

multimodal_fin.processing.preprocessing.preprocessor module

multimodal_fin.processing.preprocessing.qa_classifier module

multimodal_fin.processing.preprocessing.transcript_preprocessor module

Module contents