multimodal_fin.processing.preprocessing package
Submodules
multimodal_fin.processing.preprocessing.ensemble_classifier module
- class multimodal_fin.processing.preprocessing.ensemble_classifier.EnsembleInterventionClassifier(qa_model_names, monologue_model_names, NUM_EVALUATIONS=5, verbose=1)[source]
Bases:
objectCombines multiple Q&A and monologue classifiers to label interventions in a transcript. Handles classification and pairing of questions and answers.
- NUM_EVALUATIONS: int = 5
Number of repeated evaluations per classifier for stability.
- annotate_question_answer_pairs(df)[source]
Assigns a unique ID to valid question-answer pairs in the transcript.
- Parameters:
df (
DataFrame) – DataFrame already classified by classify_dataframe.- Return type:
DataFrame- Returns:
DataFrame with an added ‘Pair’ column for Q&A associations and ‘intervention_id’.
- Raises:
ValueError – If any detected pair does not contain exactly two elements.
- classify_dataframe(df)[source]
Classifies each row in the transcript using an ensemble of classifiers.
- Parameters:
df (
DataFrame) – DataFrame containing the transcript with a ‘Conf_Section’ column.- Returns:
‘classification’, ‘global_confidence’, and ‘model_predictions’.
- Return type:
DataFrame
- ensemble_predict(text, classifiers)[source]
Aggregates predictions from multiple classifiers for a given text.
- Parameters:
text (
str) – Text to classify.classifiers (
List) – List of classifiers (Q&A or monologue).
- Return type:
Tuple[str,float,List[Tuple[str,str,float]]]- Returns:
Tuple with predicted category, average confidence, and individual model predictions.
- monologue_model_names: List[str]
List of monologue classifier model names.
- qa_model_names: List[str]
List of Q&A classifier model names.
- verbose: int = 1
Verbosity level for logging.
multimodal_fin.processing.preprocessing.monologue_classifier module
- class multimodal_fin.processing.preprocessing.monologue_classifier.CategoryPresentation(**data)[source]
Bases:
BaseModelPydantic schema for monologue classification output.
- category: Literal['Monologue', 'Procedure']
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class multimodal_fin.processing.preprocessing.monologue_classifier.MonologueClassifier(model='llama3', NUM_EVALUATIONS=5, output_path='output')[source]
Bases:
UncertaintyMixinClassifier for identifying monologue categories using an LLM and uncertainty estimation.
- NUM_EVALUATIONS: int = 5
- classify_dataframe(df)[source]
Classifies all interventions in a DataFrame.
- Parameters:
df (
DataFrame) – DataFrame containing a ‘text’ column with interventions.- Return type:
DataFrame- Returns:
DataFrame with an added ‘classification’ column.
- classify_text(text)[source]
Classifies a single text as either ‘Monologue’ or ‘Procedure’.
- Parameters:
text (
str) – Input string representing a conference intervention.- Returns:
‘Monologue’ or ‘Procedure’.
- Return type:
str
- get_pred(text)[source]
Performs multiple evaluations and computes uncertainty score.
- Parameters:
text (
str) – Input intervention text.- Return type:
Tuple[str,float]- Returns:
A tuple of (predicted_category, confidence_score).
- model: str = 'llama3'
- output_path: str = 'output'
multimodal_fin.processing.preprocessing.preprocessor module
- class multimodal_fin.processing.preprocessing.preprocessor.Preprocessor(qa_model_names, monologue_model_names, num_evaluations=5, verbose=1, section_col='Conf_Section', text_col='text', qna_key='questions_and_answers')[source]
Bases:
objectHandles the full transcript preprocessing pipeline for a financial conference:
- Steps:
Section segmentation between ‘prepared_remarks’ and ‘q_a’.
Classification using ensemble of Q&A and monologue classifiers.
Annotation of question-answer pairs.
- divide_conference(csv_path, json_path)[source]
Assigns sections (‘prepared_remarks’ or ‘q_a’) to each row based on intro location.
- Parameters:
csv_path (
str) – Path to transcript CSV.json_path (
str) – Path to LEVEL_4.json with Q&A intro.
- Return type:
DataFrame- Returns:
DataFrame with new section column.
- extract_qna_intro(json_path)[source]
Extracts the first sentence of the Q&A section from the provided JSON.
- Parameters:
json_path (
str) – Path to the LEVEL_4.json file.- Return type:
str|None- Returns:
First sentence of Q&A section or None if not found.
- monologue_model_names: List[str]
- num_evaluations: int = 5
- process(csv_path, json_path)[source]
Executes sectioning, classification, and annotation pipeline.
- Parameters:
csv_path (
str) – Path to transcript CSV.json_path (
str) – Path to LEVEL_4.json
- Return type:
DataFrame- Returns:
Annotated and classified DataFrame.
- process_and_save(csv_path, json_path, output_csv_path)[source]
Runs the preprocessing pipeline and saves the final DataFrame to CSV.
- Parameters:
csv_path (
str) – Input transcript CSV.json_path (
str) – LEVEL_4.json.output_csv_path (
str) – Path to save the processed CSV.
- Return type:
DataFrame- Returns:
Final processed DataFrame.
- qa_model_names: List[str]
- qna_key: str = 'questions_and_answers'
- section_col: str = 'Conf_Section'
- text_col: str = 'text'
- verbose: int = 1
multimodal_fin.processing.preprocessing.qa_classifier module
- class multimodal_fin.processing.preprocessing.qa_classifier.CategoryQA(**data)[source]
Bases:
BaseModelPydantic schema for Q&A classification output.
- category: Literal['Question', 'Answer', 'Procedure']
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class multimodal_fin.processing.preprocessing.qa_classifier.QAClassifier(model='llama3', NUM_EVALUATIONS=5)[source]
Bases:
UncertaintyMixinClassifier for identifying Q&A intervention types using an LLM with uncertainty estimation.
- NUM_EVALUATIONS: int = 5
Number of times to sample the model for uncertainty estimation.
- classify_dataframe(df)[source]
Classifies all interventions in a DataFrame.
- Parameters:
df (
DataFrame) – DataFrame containing a ‘text’ column.- Return type:
DataFrame- Returns:
DataFrame with an added ‘classification’ column.
- classify_text(text)[source]
Classifies a single intervention as ‘Question’, ‘Answer’ or ‘Procedure’.
- Parameters:
text (
str) – The input intervention text.- Return type:
str- Returns:
A string category predicted by the LLM.
- get_pred(text)[source]
Performs multiple classifications and computes uncertainty score.
- Parameters:
text (
str) – Input intervention text.- Return type:
Tuple[str,float]- Returns:
A tuple of (predicted_category, confidence_score).
- model: str = 'llama3'
LLM model name.
multimodal_fin.processing.preprocessing.transcript_preprocessor module
- class multimodal_fin.processing.preprocessing.transcript_preprocessor.TranscriptPreprocessor(section_col='Conf_Section', text_col='text', qna_key='questions_and_answers')[source]
Bases:
objectPreprocesses a conference transcript by identifying the beginning of the Q&A section and labeling each row as either ‘prepared_remarks’ or ‘q_a’.
- extract_qna_intro(json_path)[source]
Extracts the first sentence of the Q&A section from the metadata JSON.
- Parameters:
json_path (
str) – Path to the JSON metadata file.- Return type:
Optional[str]- Returns:
The first sentence of the Q&A intro, or None if not found or file is invalid.
- preprocess(csv_path, json_path)[source]
Labels each row in the transcript CSV as either ‘prepared_remarks’ or ‘q_a’ based on the location of the Q&A intro.
- Parameters:
csv_path (
str) – Path to the transcript CSV file.json_path (
str) – Path to the metadata JSON file.
- Return type:
DataFrame- Returns:
A DataFrame with an added column (section_col) containing the section labels.
- qna_key: str = 'questions_and_answers'
Key used to extract the Q&A intro from the JSON metadata.
- section_col: str = 'Conf_Section'
Name of the column to write the section labels.
- text_col: str = 'text'
Column containing the transcript text.