multimodal_fin.processing.preprocessing package

Submodules

multimodal_fin.processing.preprocessing.ensemble_classifier module

class multimodal_fin.processing.preprocessing.ensemble_classifier.EnsembleInterventionClassifier(qa_model_names, monologue_model_names, NUM_EVALUATIONS=5, verbose=1)[source]

Bases: object

Combines multiple Q&A and monologue classifiers to label interventions in a transcript. Handles classification and pairing of questions and answers.

NUM_EVALUATIONS: int = 5

Number of repeated evaluations per classifier for stability.

annotate_question_answer_pairs(df)[source]

Assigns a unique ID to valid question-answer pairs in the transcript.

Parameters:

df (DataFrame) – DataFrame already classified by classify_dataframe.

Return type:

DataFrame

Returns:

DataFrame with an added ‘Pair’ column for Q&A associations and ‘intervention_id’.

Raises:

ValueError – If any detected pair does not contain exactly two elements.

classify_dataframe(df)[source]

Classifies each row in the transcript using an ensemble of classifiers.

Parameters:

df (DataFrame) – DataFrame containing the transcript with a ‘Conf_Section’ column.

Returns:

‘classification’, ‘global_confidence’, and ‘model_predictions’.

Return type:

DataFrame

ensemble_predict(text, classifiers)[source]

Aggregates predictions from multiple classifiers for a given text.

Parameters:
  • text (str) – Text to classify.

  • classifiers (List) – List of classifiers (Q&A or monologue).

Return type:

Tuple[str, float, List[Tuple[str, str, float]]]

Returns:

Tuple with predicted category, average confidence, and individual model predictions.

monologue_model_names: List[str]

List of monologue classifier model names.

qa_model_names: List[str]

List of Q&A classifier model names.

verbose: int = 1

Verbosity level for logging.

multimodal_fin.processing.preprocessing.monologue_classifier module

class multimodal_fin.processing.preprocessing.monologue_classifier.CategoryPresentation(**data)[source]

Bases: BaseModel

Pydantic schema for monologue classification output.

category: Literal['Monologue', 'Procedure']
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class multimodal_fin.processing.preprocessing.monologue_classifier.MonologueClassifier(model='llama3', NUM_EVALUATIONS=5, output_path='output')[source]

Bases: UncertaintyMixin

Classifier for identifying monologue categories using an LLM and uncertainty estimation.

NUM_EVALUATIONS: int = 5
classify_dataframe(df)[source]

Classifies all interventions in a DataFrame.

Parameters:

df (DataFrame) – DataFrame containing a ‘text’ column with interventions.

Return type:

DataFrame

Returns:

DataFrame with an added ‘classification’ column.

classify_text(text)[source]

Classifies a single text as either ‘Monologue’ or ‘Procedure’.

Parameters:

text (str) – Input string representing a conference intervention.

Returns:

‘Monologue’ or ‘Procedure’.

Return type:

str

get_pred(text)[source]

Performs multiple evaluations and computes uncertainty score.

Parameters:

text (str) – Input intervention text.

Return type:

Tuple[str, float]

Returns:

A tuple of (predicted_category, confidence_score).

model: str = 'llama3'
output_path: str = 'output'

multimodal_fin.processing.preprocessing.preprocessor module

class multimodal_fin.processing.preprocessing.preprocessor.Preprocessor(qa_model_names, monologue_model_names, num_evaluations=5, verbose=1, section_col='Conf_Section', text_col='text', qna_key='questions_and_answers')[source]

Bases: object

Handles the full transcript preprocessing pipeline for a financial conference:

Steps:
  1. Section segmentation between ‘prepared_remarks’ and ‘q_a’.

  2. Classification using ensemble of Q&A and monologue classifiers.

  3. Annotation of question-answer pairs.

divide_conference(csv_path, json_path)[source]

Assigns sections (‘prepared_remarks’ or ‘q_a’) to each row based on intro location.

Parameters:
  • csv_path (str) – Path to transcript CSV.

  • json_path (str) – Path to LEVEL_4.json with Q&A intro.

Return type:

DataFrame

Returns:

DataFrame with new section column.

extract_qna_intro(json_path)[source]

Extracts the first sentence of the Q&A section from the provided JSON.

Parameters:

json_path (str) – Path to the LEVEL_4.json file.

Return type:

str | None

Returns:

First sentence of Q&A section or None if not found.

monologue_model_names: List[str]
num_evaluations: int = 5
process(csv_path, json_path)[source]

Executes sectioning, classification, and annotation pipeline.

Parameters:
  • csv_path (str) – Path to transcript CSV.

  • json_path (str) – Path to LEVEL_4.json

Return type:

DataFrame

Returns:

Annotated and classified DataFrame.

process_and_save(csv_path, json_path, output_csv_path)[source]

Runs the preprocessing pipeline and saves the final DataFrame to CSV.

Parameters:
  • csv_path (str) – Input transcript CSV.

  • json_path (str) – LEVEL_4.json.

  • output_csv_path (str) – Path to save the processed CSV.

Return type:

DataFrame

Returns:

Final processed DataFrame.

qa_model_names: List[str]
qna_key: str = 'questions_and_answers'
section_col: str = 'Conf_Section'
text_col: str = 'text'
verbose: int = 1

multimodal_fin.processing.preprocessing.qa_classifier module

class multimodal_fin.processing.preprocessing.qa_classifier.CategoryQA(**data)[source]

Bases: BaseModel

Pydantic schema for Q&A classification output.

category: Literal['Question', 'Answer', 'Procedure']
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class multimodal_fin.processing.preprocessing.qa_classifier.QAClassifier(model='llama3', NUM_EVALUATIONS=5)[source]

Bases: UncertaintyMixin

Classifier for identifying Q&A intervention types using an LLM with uncertainty estimation.

NUM_EVALUATIONS: int = 5

Number of times to sample the model for uncertainty estimation.

classify_dataframe(df)[source]

Classifies all interventions in a DataFrame.

Parameters:

df (DataFrame) – DataFrame containing a ‘text’ column.

Return type:

DataFrame

Returns:

DataFrame with an added ‘classification’ column.

classify_text(text)[source]

Classifies a single intervention as ‘Question’, ‘Answer’ or ‘Procedure’.

Parameters:

text (str) – The input intervention text.

Return type:

str

Returns:

A string category predicted by the LLM.

get_pred(text)[source]

Performs multiple classifications and computes uncertainty score.

Parameters:

text (str) – Input intervention text.

Return type:

Tuple[str, float]

Returns:

A tuple of (predicted_category, confidence_score).

model: str = 'llama3'

LLM model name.

multimodal_fin.processing.preprocessing.transcript_preprocessor module

class multimodal_fin.processing.preprocessing.transcript_preprocessor.TranscriptPreprocessor(section_col='Conf_Section', text_col='text', qna_key='questions_and_answers')[source]

Bases: object

Preprocesses a conference transcript by identifying the beginning of the Q&A section and labeling each row as either ‘prepared_remarks’ or ‘q_a’.

extract_qna_intro(json_path)[source]

Extracts the first sentence of the Q&A section from the metadata JSON.

Parameters:

json_path (str) – Path to the JSON metadata file.

Return type:

Optional[str]

Returns:

The first sentence of the Q&A intro, or None if not found or file is invalid.

preprocess(csv_path, json_path)[source]

Labels each row in the transcript CSV as either ‘prepared_remarks’ or ‘q_a’ based on the location of the Q&A intro.

Parameters:
  • csv_path (str) – Path to the transcript CSV file.

  • json_path (str) – Path to the metadata JSON file.

Return type:

DataFrame

Returns:

A DataFrame with an added column (section_col) containing the section labels.

qna_key: str = 'questions_and_answers'

Key used to extract the Q&A intro from the JSON metadata.

section_col: str = 'Conf_Section'

Name of the column to write the section labels.

text_col: str = 'text'

Column containing the transcript text.

Module contents