Skip to content

NLP-CISUC/FlowDisco

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

128 Commits
 
 
 
 
 
 
 
 

Repository files navigation

FlowDisco

This repository contains the FlowDisco pipeline for the automatic discovery, analysis, and visualization of dialogue flows from conversational histories. The approach follows a flexible three-step methodology. It begins with the representation of utterances, which are embedded into a vector space using models such as Sentence Transformers. Subsequently, state discovery assigns these utterances to discrete dialogue states dynamically. This can be achieved utilizing unsupervised clustering by semantic similarity, explicit categorization via predefined keywords, or LLM-based zero-shot labeling. Finally, these identified states serve as vertices in a directed transition graph, effectively mapping the chronological dynamics and conversational patterns of the interactions.

Beyond this core methodology, FlowDisco introduces robust quantitative evaluation metrics, such as Transition Coverage and Flow F1-Score, to measure the alignment between test data transitions and discovered reference flows. The analytical capabilities are further enhanced by sentiment enrichment, which provides visual cues for the emotional trajectory of the dialogue through metrics like Flow Sentiment Cohesion. Moreover, the pipeline facilitates deep comparative analytics by quantifying key interaction aspects—such as graph density and action-to-dialogue ratios. This allows for objective comparisons across highly distinct domains, from customer support and political debates to the structural analysis of poetry.

Finally, the entire ecosystem is wrapped in an interactive web application. This interface empowers users to dynamically explore the generated graphs, adjust transition probability thresholds, filter by specific speakers, and deeply inspect the raw utterances comprising each dialogue state.

Datasets

The datasets are located in the data/ folder and should have the following column format:

  • turn_id: Unique identifier for each turn in the dialogue.
  • dialogue_id: Identifier of the dialogue to which the utterance belongs.
  • speaker: Identifier of the speaker (e.g., Speaker 1, Speaker 2).
  • utterance: The sentence spoken by the speaker in that turn.

Datasets with Sentiment

If we want to include sentiment in the flows, the dataset must have the following column:

  • Median_Binary: Binary sentiment calculated for each utterance (1 for non-negative, 0 for negative).

Analysis Notebook

The FlowDisco.ipynb file is a Colab notebook that contains the code necessary to perform the following tasks:

  1. Utterance Representation: Conversion of utterances into vectors using embedding techniques.
  2. Clustering: Grouping of utterances into clusters based on their vector representations.
  3. Labelling: Labelling of clusters to identify dialogue patterns.
  4. Sentiment Analysis: If the dataset includes the Median_Binary column, sentiment is incorporated into the dialogue flow analysis.
  5. Flow Discovery: Identification of dialogue flow patterns between different dialogue states.

How to Run

  1. In the "Sentence Transformers Models" section:

    • Define the language of the stopwords to be used (English or Portuguese).
    • Choose the sentence transformer model (variable MODEL_ML) to be used to convert utterances into vectors.
  2. In the "Parameters" section:

    • Set the training dataset (variable filename) and the test dataset (variable filename_test).
    • Choose the clustering algorithm to be used (variable algorithm). Possible values: 'kmeans' and 'dbscan'.
    • Define the labelling method to be used (variable labelling). Possible values: 'verbs', 'keybert', 'closest' and 'llm'.
    • Specify the metric to optimize (variable metric_to_optimize). Possible values: 'silhouette' and 'vmeasure'.
    • Choose the threshold value for flow simplification (variable threshold). Range: 0 (min) to 0.20 (max).
    • Set the number of trials for Optuna (variable n_trials). Range: 1 (min) to 100 (max).
    • Define whether to show the count of utterances per cluster in the labels (variable count_utterances_label). Possible values: True (shows the number of utterances per cluster in labels) or False (no count shown).
    • Set whether system transitions should include sentiment (variable sentiment_system). Possible values: True (system transitions have sentiment-based color) or False (system transitions appear in black).
    • Specify whether to include sentiment in the flow (variable sentiment_in_flow). Possible values: True (sentiment is reflected in flow colors) or False (sentiment is not included).
    • Specify the number of previous utterances to consider for context (variable id_max). Range: 1 (min) to N (max), where N depends on the dataset size. If id_max is 1, only the current utterance is considered; if 2, both the current and the previous one are used, and so on.
  3. After adjusting the parameters above, run the remaining cells.

  • Libraries used in the notebook:
    • numpy
    • pandas
    • sklearn
    • matplotlib
    • seaborn
    • Others can be found in the first cell of the notebook.

How to cite

This project was presented in two scientific papers, both proposing innovative approaches to dialogue flow discovery. Below are the BibTeX references for the two papers.

Paper 1: Unsupervised Flow Discovery from Task-oriented Dialogues

A paper proposing an innovative approach to unsupervised discovery of dialogue flows from conversation history and an automatic validation metric was presented at the 23rd International Conference on Hybrid Intelligent Systems (HIS 2023). See BibTex:

@inproceedings{ferreira2023unsupervised,
    title = {Unsupervised Flow Discovery from Task-oriented Dialogues},
    author = {Patrícia Ferreira and Daniel Martins and Ana Alves and Catarina Silva and Hugo {Gonçalo~Oliveira}},
    booktitle = {Proceedings of 23nd International Conference on Hybrid Intelligent Systems (HIS 2023)},
    year = {2023}
    publisher = {Springer}
}

Paper 2: Sentiment-Aware Dialogue Flow Discovery for Customer-Support Services

This paper presents a generic approach to dialogue flow discovery, using clustering techniques to identify dialogue states and state transitions, as well as analyzing the prevailing sentiment. The approach aims to enhance interpretability and provide support to artificial agents in customer support scenarios. Below is the corresponding BibTeX:

@inproceedings{ferreira-etal-2024-sentiment,
    title = "Sentiment-Aware Dialogue Flow Discovery for Interpreting Communication Trends",
    author = "Ferreira, Patr{\'\i}cia Sofia Pereira  and
      Carvalho, Isabel  and
      Alves, Ana  and
      Silva, Catarina  and
      Oliveira, Hugo Gon{\c{c}}alo",
    editor = "Kawahara, Tatsuya  and
      Demberg, Vera  and
      Ultes, Stefan  and
      Inoue, Koji  and
      Mehri, Shikib  and
      Howcroft, David  and
      Komatani, Kazunori",
    booktitle = "Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue",
    month = sep,
    year = "2024",
    address = "Kyoto, Japan",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.sigdial-1.24",
    doi = "10.18653/v1/2024.sigdial-1.24",
    pages = "274--288",
    abstract = "Customer-support services increasingly rely on automation, whether fully or with human intervention. Despite optimising resources, this may result in mechanical protocols and lack of human interaction, thus reducing customer loyalty. Our goal is to enhance interpretability and provide guidance in communication through novel tools for easier analysis of message trends and sentiment variations. Monitoring these contributes to more informed decision-making, enabling proactive mitigation of potential issues, such as protocol deviations or customer dissatisfaction. We propose a generic approach for dialogue flow discovery that leverages clustering techniques to identify dialogue states, represented by related utterances. State transitions are further analyzed to detect prevailing sentiments. Hence, we discover sentiment-aware dialogue flows that offer an interpretability layer to artificial agents, even those based on black-boxes, ultimately increasing trustworthiness. Experimental results demonstrate the effectiveness of our approach across different dialogue datasets, covering both human-human and human-machine exchanges, applicable in task-oriented contexts but also to social media, highlighting its potential impact across various customer-support settings.",
}

Paper 3: UnHIDE: A Novel Framework for Unsupervised Human-Interpretable Dialogue Exploration

We introduce UnHIDE, a novel, unsupervised framework for Human-Interpretable Dialogue Exploration. UnHIDE is designed to support human understanding of large collections of dialogues by surfacing interpretable structures and trends. It operates in three stages: (1) utterance clustering to group semantically similar dialogue turns, (2) flow discovery to build dialogue trajectories based on these clusters, and (3) the computation of interpretable metrics to analyze flow complexity, sentiment progression, and response times. We evaluate UnHIDE using a newly-created, automatically-generated, task-oriented dialogue dataset, where dialogue length, sentiment dynamics, and timing are systematically varied. This paper was accepted at the IEEE Access Journal.

@article{ferreira2025unhide,
  title={UnHIDE: A Novel Framework for Unsupervised Human-Interpretable Dialogue Exploration},
  author={Ferreira, Patr{\'\i}cia and Alves, Ana and Silva, Catarina and Oliveira, Hugo Gon{\c{c}}alo},
  journal={IEEE Access},
  volume={13},
  pages={200001--200014},
  year={2025},
  publisher={IEEE}
}

Paper 4: Analyzing Debate Dynamics in the Portuguese Parliament with Dialogue Action Flows

Analyzing how large-scale multi-party dialogues shape collective behavior is a central challenge in computational linguistics. However, traditional text-based methods often overlook the complex, non-linear turn-taking dynamics defining these interactions. To address this gap, we propose a framework based on Dialogue Action Flows (DAFs) that integrates verbal utterances and non-verbal actions into a unified probabilistic representation of interactional behavior. Interactions are encoded as speaker-action states, forming a probabilistic DAF that reveals dominant behavioral trajectories and recurrent patterns. We validate this framework on five years of Portuguese Parliament debates. Analysis reveals systematic behavioral asymmetries driven by party roles: while government parties exhibit increasing alignment, opposition forces, particularly the radical wing, maintain persistently high conflict. Additionally, the rising volume of interactions across legislative years indicates a progressively heated environment. Overall, our framework provides a quantitative and interpretable approach for modeling polarization, alignment, and interactional dynamics in multi-party political discourse.

@inproceedings{ferreira2026analyzing,
  title={Analyzing Debate Dynamics in the Portuguese Parliament with Dialogue Action Flows},
  author={Ferreira, Patr{\'\i}cia and Alves, Ana and Silva, Catarina and Oliveira, Hugo Gon{\c{c}}alo},
  booktitle={Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026)-Vol. 1},
  pages={369--379},
  year={2026}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages