Upload documents and apply state-of-the-art NLP methods — real BERTopic with BERT embeddings, Latent Dirichlet Allocation, TF-IDF term weighting, and transformer-based sentiment analysis — in seconds.
This almost always means min_topic_size is too large relative to your corpus. HDBSCAN requires at least min_topic_size documents to form a cluster — if your corpus has 50 documents and min_topic_size is 20, only themes with 20+ documents will appear as topics; everything else becomes outlier noise (-1). Try reducing min_topic_size to 5–10 for small corpora. Also check that your ngrok/Colab URL is active — a disconnected backend will return an empty response silently.
BERTopic uses deep neural embeddings (BERT) to understand the meaning of words in context. It is more accurate, especially for complex or ambiguous language, but requires a GPU and significant computation time. It works best on corpora of 500+ documents.
LDA treats each document as a bag of words — word order and context are ignored. It is much faster, runs on any machine, and works well on corpora as small as 50 documents. For most research use cases, LDA is the practical starting point and BERTopic is used to validate or deepen the findings.
It depends on the method. LDA works from around 50 documents, though 200+ produces more stable results. BERTopic needs at least 100–200 documents and performs best above 500. TF-IDF and Sentiment Analysis work on any number of documents — even a single file. Word Cloud follows LDA's requirements since it runs LDA internally. As a general rule: more documents → more reliable topics.
This is the most common preprocessing problem. Domain-specific high-frequency words pass through the stopword filter because they are not in the standard English stopword list. The fix is to add these words to the Custom Stopwords field (comma-separated) and rerun. Also try enabling Lemmatization and lowering Max Doc Frequency to automatically exclude corpus-wide terms.
Use the Coherence Score tool — it automatically tests a range of K values and identifies the one producing the most semantically coherent topics. As a starting heuristic, try K ≈ √(N/2) where N is the number of documents (e.g. 200 documents → start around K = 10). Always inspect the actual topic words after selecting K — a coherence-optimal K that produces uninterpretable topics is less useful than a slightly lower K with clearer themes.
VADER was designed for social media and informal English. It struggles with sarcasm, formal academic writing, domain-specific language, and non-English text. If your corpus uses specialised vocabulary or implied sentiment, switch to RoBERTa (requires Colab). For non-English text, neither model will perform reliably.
TXT is fastest — plain text requires no parsing. XLSX is medium speed. PDF is slowest — the binary structure is parsed page by page, and scanned/image-based PDFs return empty text. Use the ⚡ to TXT button next to any uploaded PDF to convert it client-side before running analysis.
ngrok URLs are session-based — they expire when the Colab runtime is idle or restarted. Each time you rerun the notebook, a new URL is generated. Click ⚙ Advanced, paste the new URL from your Colab output, and click Test Connection. The Render backend for LDA/VADER is always-on but may take 30–60 seconds to wake from idle.
Your files never leave your session. All documents you upload are sent directly to the analysis backend solely for processing. No file content is stored, logged, or retained after the analysis completes. Results exist only in your browser and are discarded when you close or refresh the page.
No personal data is collected. TopicMiner does not use cookies, does not track usage, and does not store any identifying information. There is no account system and no database. Each session is completely stateless.
Colab backend: When using BERTopic via Google Colab, your documents are processed on a Google-provided virtual machine that you control. The Colab session and all its data are deleted when the runtime is terminated.
TopicMiner is a free, open-access NLP research platform built for academics, students, and researchers who need to extract structure and meaning from text corpora without requiring programming knowledge. It brings together five state-of-the-art analysis methods — BERTopic, LDA, TF-IDF, Word Cloud, and Sentiment Analysis — in a single unified interface, each grounded in peer-reviewed research and implemented according to best practices from the computational linguistics literature.
Unlike commercial tools such as MonkeyLearn, IBM Watson, or Google Cloud NLP, TopicMiner is entirely free, requires no API keys or subscriptions, and is designed specifically for academic and research use cases rather than business analytics. Every result is explainable and traceable back to the underlying model parameters.
TopicMiner is designed for researchers and graduate students in linguistics, social sciences, education, communication, and any field that involves qualitative text analysis. It is particularly suited for thesis and dissertation work, where systematic, reproducible, and methodologically grounded text analysis is required but computational resources and programming expertise may be limited.
It is also useful for educators teaching NLP concepts, as the explainer panels provide academic citations and plain-language explanations of each method alongside the live analysis tool.
Charalampos Mentsios — developed TopicMiner as a research tool to support NLP-based text analysis in academic settings. The platform was built from the ground up using FastAPI for the backend, deployed on Google Colab (GPU) and Render, with a custom HTML/CSS/JS frontend designed for clarity and usability in research contexts.
Feedback, suggestions, and collaboration inquiries are welcome via LinkedIn.