This repository contains the FlowDisco pipeline for the automatic discovery, analysis, and visualization of dialogue flows from conversational histories. The approach follows a flexible three-step methodology. It begins with the representation of utterances, which are embedded into a vector space using models such as Sentence Transformers. Subsequently, state discovery assigns these utterances to discrete dialogue states dynamically. This can be achieved utilizing unsupervised clustering by semantic similarity, explicit categorization via predefined keywords, or LLM-based zero-shot labeling. Finally, these identified states serve as vertices in a directed transition graph, effectively mapping the chronological dynamics and conversational patterns of the interactions.
Beyond this core methodology, FlowDisco introduces robust quantitative evaluation metrics, such as Transition Coverage and Flow F1-Score, to measure the alignment between test data transitions and discovered reference flows. The analytical capabilities are further enhanced by sentiment enrichment, which provides visual cues for the emotional trajectory of the dialogue through metrics like Flow Sentiment Cohesion. Moreover, the pipeline facilitates deep comparative analytics by quantifying key interaction aspects—such as graph density and action-to-dialogue ratios. This allows for objective comparisons across highly distinct domains, from customer support and political debates to the structural analysis of poetry.
Finally, the entire ecosystem is wrapped in an interactive web application. This interface empowers users to dynamically explore the generated graphs, adjust transition probability thresholds, filter by specific speakers, and deeply inspect the raw utterances comprising each dialogue state.
The datasets are located in the data/ folder and should have the following column format:
- turn_id: Unique identifier for each turn in the dialogue.
- dialogue_id: Identifier of the dialogue to which the utterance belongs.
- speaker: Identifier of the speaker (e.g.,
Speaker 1,Speaker 2). - utterance: The sentence spoken by the speaker in that turn.
If we want to include sentiment in the flows, the dataset must have the following column:
- Median_Binary: Binary sentiment calculated for each utterance (1 for non-negative, 0 for negative).
The FlowDisco.ipynb file is a Colab notebook that contains the code necessary to perform the following tasks:
- Utterance Representation: Conversion of utterances into vectors using embedding techniques.
- Clustering: Grouping of utterances into clusters based on their vector representations.
- Labelling: Labelling of clusters to identify dialogue patterns.
- Sentiment Analysis: If the dataset includes the
Median_Binarycolumn, sentiment is incorporated into the dialogue flow analysis. - Flow Discovery: Identification of dialogue flow patterns between different dialogue states.
-
In the "Sentence Transformers Models" section:
- Define the language of the stopwords to be used (English or Portuguese).
- Choose the sentence transformer model (variable
MODEL_ML) to be used to convert utterances into vectors.
-
In the "Parameters" section:
- Set the training dataset (variable
filename) and the test dataset (variablefilename_test). - Choose the clustering algorithm to be used (variable
algorithm). Possible values: 'kmeans' and 'dbscan'. - Define the labelling method to be used (variable
labelling). Possible values: 'verbs', 'keybert', 'closest' and 'llm'. - Specify the metric to optimize (variable
metric_to_optimize). Possible values: 'silhouette' and 'vmeasure'. - Choose the threshold value for flow simplification (variable
threshold). Range: 0 (min) to 0.20 (max). - Set the number of trials for Optuna (variable
n_trials). Range: 1 (min) to 100 (max). - Define whether to show the count of utterances per cluster in the labels (variable
count_utterances_label). Possible values: True (shows the number of utterances per cluster in labels) or False (no count shown). - Set whether system transitions should include sentiment (variable
sentiment_system). Possible values: True (system transitions have sentiment-based color) or False (system transitions appear in black). - Specify whether to include sentiment in the flow (variable
sentiment_in_flow). Possible values: True (sentiment is reflected in flow colors) or False (sentiment is not included). - Specify the number of previous utterances to consider for context (variable
id_max). Range: 1 (min) to N (max), where N depends on the dataset size. If id_max is 1, only the current utterance is considered; if 2, both the current and the previous one are used, and so on.
- Set the training dataset (variable
-
After adjusting the parameters above, run the remaining cells.
- Libraries used in the notebook:
numpypandassklearnmatplotlibseaborn- Others can be found in the first cell of the notebook.
This project was presented in two scientific papers, both proposing innovative approaches to dialogue flow discovery. Below are the BibTeX references for the two papers.
A paper proposing an innovative approach to unsupervised discovery of dialogue flows from conversation history and an automatic validation metric was presented at the 23rd International Conference on Hybrid Intelligent Systems (HIS 2023). See BibTex:
@inproceedings{ferreira2023unsupervised,
title = {Unsupervised Flow Discovery from Task-oriented Dialogues},
author = {Patrícia Ferreira and Daniel Martins and Ana Alves and Catarina Silva and Hugo {Gonçalo~Oliveira}},
booktitle = {Proceedings of 23nd International Conference on Hybrid Intelligent Systems (HIS 2023)},
year = {2023}
publisher = {Springer}
}
This paper presents a generic approach to dialogue flow discovery, using clustering techniques to identify dialogue states and state transitions, as well as analyzing the prevailing sentiment. The approach aims to enhance interpretability and provide support to artificial agents in customer support scenarios. Below is the corresponding BibTeX:
@inproceedings{ferreira-etal-2024-sentiment,
title = "Sentiment-Aware Dialogue Flow Discovery for Interpreting Communication Trends",
author = "Ferreira, Patr{\'\i}cia Sofia Pereira and
Carvalho, Isabel and
Alves, Ana and
Silva, Catarina and
Oliveira, Hugo Gon{\c{c}}alo",
editor = "Kawahara, Tatsuya and
Demberg, Vera and
Ultes, Stefan and
Inoue, Koji and
Mehri, Shikib and
Howcroft, David and
Komatani, Kazunori",
booktitle = "Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue",
month = sep,
year = "2024",
address = "Kyoto, Japan",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.sigdial-1.24",
doi = "10.18653/v1/2024.sigdial-1.24",
pages = "274--288",
abstract = "Customer-support services increasingly rely on automation, whether fully or with human intervention. Despite optimising resources, this may result in mechanical protocols and lack of human interaction, thus reducing customer loyalty. Our goal is to enhance interpretability and provide guidance in communication through novel tools for easier analysis of message trends and sentiment variations. Monitoring these contributes to more informed decision-making, enabling proactive mitigation of potential issues, such as protocol deviations or customer dissatisfaction. We propose a generic approach for dialogue flow discovery that leverages clustering techniques to identify dialogue states, represented by related utterances. State transitions are further analyzed to detect prevailing sentiments. Hence, we discover sentiment-aware dialogue flows that offer an interpretability layer to artificial agents, even those based on black-boxes, ultimately increasing trustworthiness. Experimental results demonstrate the effectiveness of our approach across different dialogue datasets, covering both human-human and human-machine exchanges, applicable in task-oriented contexts but also to social media, highlighting its potential impact across various customer-support settings.",
}
We introduce UnHIDE, a novel, unsupervised framework for Human-Interpretable Dialogue Exploration. UnHIDE is designed to support human understanding of large collections of dialogues by surfacing interpretable structures and trends. It operates in three stages: (1) utterance clustering to group semantically similar dialogue turns, (2) flow discovery to build dialogue trajectories based on these clusters, and (3) the computation of interpretable metrics to analyze flow complexity, sentiment progression, and response times. We evaluate UnHIDE using a newly-created, automatically-generated, task-oriented dialogue dataset, where dialogue length, sentiment dynamics, and timing are systematically varied. This paper was accepted at the IEEE Access Journal.
@article{ferreira2025unhide,
title={UnHIDE: A Novel Framework for Unsupervised Human-Interpretable Dialogue Exploration},
author={Ferreira, Patr{\'\i}cia and Alves, Ana and Silva, Catarina and Oliveira, Hugo Gon{\c{c}}alo},
journal={IEEE Access},
volume={13},
pages={200001--200014},
year={2025},
publisher={IEEE}
}
Analyzing how large-scale multi-party dialogues shape collective behavior is a central challenge in computational linguistics. However, traditional text-based methods often overlook the complex, non-linear turn-taking dynamics defining these interactions. To address this gap, we propose a framework based on Dialogue Action Flows (DAFs) that integrates verbal utterances and non-verbal actions into a unified probabilistic representation of interactional behavior. Interactions are encoded as speaker-action states, forming a probabilistic DAF that reveals dominant behavioral trajectories and recurrent patterns. We validate this framework on five years of Portuguese Parliament debates. Analysis reveals systematic behavioral asymmetries driven by party roles: while government parties exhibit increasing alignment, opposition forces, particularly the radical wing, maintain persistently high conflict. Additionally, the rising volume of interactions across legislative years indicates a progressively heated environment. Overall, our framework provides a quantitative and interpretable approach for modeling polarization, alignment, and interactional dynamics in multi-party political discourse.
@inproceedings{ferreira2026analyzing,
title={Analyzing Debate Dynamics in the Portuguese Parliament with Dialogue Action Flows},
author={Ferreira, Patr{\'\i}cia and Alves, Ana and Silva, Catarina and Oliveira, Hugo Gon{\c{c}}alo},
booktitle={Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026)-Vol. 1},
pages={369--379},
year={2026}
}