HTML-TM - About
HTML-TM (HyperText Markup Language for Text Mining) is a tool designed to facilitate the exploration and analysis of large text datasets. It provides an interactive and user-friendly interface for navigating through words, documents, and their relationships within the corpus. Below, the key components and functionalities of the HTML-TM tool are described.
Key Components of HTML-TM
1. WORDS.html
The WORDS.html file serves as the primary entry point for exploring words within the dataset. It contains a structured table with the following columns:
- #: The index number of the word in the dataset, providing a unique identifier for each entry.
- Word: The specific word from the dataset being analyzed.
- Occ.: The number of occurrences of the word across documents, indicating how many documents contain the word.
- Related Words: A list of the top 10 words most closely related to the target word, based on cosine similarity.
- Related Doc.: A link to a separate page listing documents (papers) most closely associated with the word, based on cosine similarity. For more details about this page, see the "Related Documents Page" section below.
- Word Tree: A hierarchical visualization of the 30 most related terms for the word, providing a structured overview of semantic connections.
- Year Plot: A graphical representation of the word's usage over time, helping users identify trends and patterns.
2. TEXTS.html
The TEXTS.html file is designed for exploring documents within the corpus. It includes a table with the following columns:
- #: The index number of the document in the dataset, serving as a unique identifier.
- Title: The title of the document (paper), providing a quick reference to its content.
- Year: The publication year of the document.
- Related Doc.: A link to a separate page listing documents most similar to the current one, based on cosine similarity. For more details about this page, see the "Related Documents Page" section below.
- PMID: The PubMed ID of the related document, linked to its corresponding page on PubMed for easy access.
Related Documents Page
Both WORDS.html and TEXTS.html direct users to a Related Documents page when clicking on the "Related Doc." link. This page contains a table with the following columns:
- #: The index number of the related document.
- Rank: The ranking of the document based on its similarity score to the target word or document.
- Similarity: A numerical score indicating the degree of similarity between the target and the related document, calculated using cosine similarity.
- Title + Abs.: The title of the related document, accompanied by its full abstract.
- Year: The publication year of the document.
- PMID: The PubMed ID of the related document, linked to its corresponding page on PubMed for easy access.
Search Functionality
HTML-TM provides a table search tool that allows users to filter rows using complex queries, including logical operators (AND/OR), column-specific searches, and regular expressions. Detailed information can be accessed via the Help button located next to the search button on the WORDS.html and TEXTS.html pages.
Authors and Institution
The HTML-TM tool is developed by:
- Roberto T. Raittz1,2
- Diogo de J. S. Machado1,2
Affiliations:
- Laboratory of Artificial Intelligence Applied to Bioinformatics, Federal University of ParanĂ¡, Curitiba, PR, Brazil
- Graduate Program in Bioinformatics, Federal University of ParanĂ¡, Curitiba, PR, Brazil