HTML-TM - About

HTML-TM (HyperText Markup Language for Text Mining) is a tool designed to facilitate the exploration and analysis of large text datasets. It provides an interactive and user-friendly interface for navigating through words, documents, and their relationships within the corpus. Below, the key components and functionalities of the HTML-TM tool are described.

Key Components of HTML-TM

1. WORDS.html

The WORDS.html file serves as the primary entry point for exploring words within the dataset. It contains a structured table with the following columns:

#: The index number of the word in the dataset, providing a unique identifier for each entry.
Word: The specific word from the dataset being analyzed.
Occ.: The number of occurrences of the word across documents, indicating how many documents contain the word.
Related Words: A list of the top 10 words most closely related to the target word, based on cosine similarity.
Related Doc.: A link to a separate page listing documents (papers) most closely associated with the word, based on cosine similarity. For more details about this page, see the "Related Documents Page" section below.
Word Tree: A hierarchical visualization of the 30 most related terms for the word, providing a structured overview of semantic connections.
Year Plot: A graphical representation of the word's usage over time, helping users identify trends and patterns.

2. TEXTS.html

The TEXTS.html file is designed for exploring documents within the corpus. It includes a table with the following columns:

#: The index number of the document in the dataset, serving as a unique identifier.
Title: The title of the document (paper), providing a quick reference to its content.
Year: The publication year of the document.
Related Doc.: A link to a separate page listing documents most similar to the current one, based on cosine similarity. For more details about this page, see the "Related Documents Page" section below.
PMID: The PubMed ID of the related document, linked to its corresponding page on PubMed for easy access.

Search Functionality

HTML-TM provides a table search tool that allows users to filter rows using complex queries, including logical operators (AND/OR), column-specific searches, and regular expressions. Detailed information can be accessed via the Help button located next to the search button on the WORDS.html and TEXTS.html pages.

Authors and Institution

The HTML-TM tool is developed by:

Roberto T. Raittz^1,2
Diogo de J. S. Machado^1,2

Affiliations: