Key Technologies / Libraries:
- Topic extraction / LDA (spaCy, gensim)
- Clustering (Scikit-learn)
- Web-hosted interactive dashboard (Plotly Dash)
Data points rarely exist independent of each other. In fact, in many domains, relationships between data points are one of the key signals that indicate how important a data point is amongst a sea of them.
In this project, we visualised the CORD-19 dataset of academic papers as an interactive network. Each paper is represented as a circle, the number of citations is represented by the size of the circle, and each citation is represented as a line between circles.
The app is interactive so that the user is able to filter the dataset dynamically to speed up their research. They can filter by the individual journals, or choose to only represent the more "significant" papers that have received higher numbers of citations.
Each academic paper is processed so that its key concepts and topics are extracted. This enables similar papers to be grouped such that "clusters" of similar papers can be identified.
In turn, the network diagram quickly identifies for the user key academic papers in each cluster, and identifies which clusters are more densely or sparsely interrelated by the number of connections between them.