checking…

Open Research Tool

Uncover
Hidden Structure
in Text Corpora

Upload documents and apply state-of-the-art NLP methods — real BERTopic with BERT embeddings, Latent Dirichlet Allocation, TF-IDF term weighting, and transformer-based sentiment analysis — in seconds.

Analysis Methods

File Formats Supported

∞

Documents Per Run

Upload Corpus

Multiple files supported

📚

Drop documents here

or click to browse — TXT, PDF, XLSX

TXT PDF XLSX

📚 0 files loaded

Hide

Analysis Method

Select one

🔮

BERTopic

BERT embeddings + HDBSCAN — requires Colab URL

🧩

LDA

Latent Dirichlet Allocation — fast topic modeling

📊

TF-IDF

Term frequency-inverse document frequency scoring

☁️

Word Cloud

Auto-generated after LDA — or run standalone

💬

Sentiment Analysis

VADER rule-based (fast) or RoBERTa transformer (accurate, Colab only)

Configuration

Number of Topics 10

Min Topic Size 5

💡 Min Topic Size Advisor

Suggestions for your corpus:

Language

Reduce outliers

Merge noise into nearest topic

🔬 Preprocessing

Custom Stopwords

Lemmatization

running→run (reduces noise)

Processing...

Sending files to backend...

Coherence Analysis

Optimal: — topics

Coherence Score (higher = better)

Perplexity (lower = better fit)

🎯

🔬

Results will appear here

Upload files · choose method · click Run Analysis

✦ AI Results Evaluation Claude · context-aware

How to Use TopicMiner

Getting Started

Connect Backend

Paste your Render URL for LDA, TF-IDF, and Sentiment. For real BERTopic with BERT embeddings, run the Colab notebook and paste the ngrok URL.

Upload Corpus

Upload one or more TXT, PDF, or XLSX files. Multiple files are treated as separate documents — ideal for comparing corpora or analyzing a collection.

Configure & Run

Select your analysis method, adjust preprocessing options (stopwords, lemmatization, n-grams), set the number of topics, and click Run Analysis.

Interpret Results

Read topic word distributions, TF-IDF rankings, sentiment breakdowns, or word clouds. Use Coherence Scores to find the statistically optimal number of topics.

References & Further Reading

Academic Sources

Show

Latent Dirichlet Allocation

Blei, D.M., Ng, A.Y., & Jordan, M.I. (2003). Journal of Machine Learning Research, 3, 993–1022.

BERTopic: Neural Topic Modeling with c-TF-IDF

Grootendorst, M. (2022). arXiv preprint arXiv:2203.05794.

BERT: Pre-training of Deep Bidirectional Transformers

Devlin, J., Chang, M.W., Lee, K., & Toutanova, K. (2019). NAACL-HLT 2019, 4171–4186.

VADER: A Parsimonious Rule-based Model for Sentiment Analysis

Hutto, C.J., & Gilbert, E. (2014). ICWSM 2014, 8(1), 216–225.

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Liu, Y., Ott, M., Goyal, N., et al. (2019). arXiv preprint arXiv:1907.11692.

TweetEval: Unified Benchmark for Tweet Classification

Barbieri, F., Camacho-Collados, J., Espinosa-Anke, L., & Neves, L. (2020). EMNLP Findings 2020. [cardiffnlp/twitter-roberta-base-sentiment]

Exploring the Space of Topic Coherence Measures (C_V metric)

Röder, M., Both, A., & Hinneburg, A. (2015). WSDM 2015, 399–408.

Optimizing Semantic Coherence in Topic Models

Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). EMNLP 2011, 262–272.

UMAP: Uniform Manifold Approximation and Projection

McInnes, L., Healy, J., & Melville, J. (2018). arXiv preprint arXiv:1802.03426. [Used in BERTopic dimensionality reduction]

HDBSCAN: Hierarchical Density-Based Clustering

McInnes, L., Healy, J., & Astels, S. (2017). Journal of Open Source Software, 2(11), 205. [Used in BERTopic clustering]

Text Mining Technologies Applied to Free-Text Answers in e-Assessment

Charitopoulos, A., Rangoussi, M., Metafas, D., & Koulouriotis, D. (2025). Discover Computing, 28, 5.

Uncovering the Structure of Amazon Product Reviews

Hou, Y., Li, J., He, Z., Yan, A., Chen, X., & McAuley, J. (2024). ACL 2024. [Amazon Movies & TV dataset]

Glossary of Key Terms

NLP Terminology

Show

Corpus

A structured collection of text documents used as input for NLP analysis.

Topic

A probability distribution over vocabulary words that co-occur in similar contexts.

Stopwords

High-frequency function words (the, is, at) removed before analysis as they carry no discriminative meaning.

Lemmatization

Reducing a word to its dictionary base form using morphological analysis. "running" → "run".

Stemming

Mechanically stripping word suffixes to find a common root. Faster but less accurate than lemmatization.

TF-IDF

Term Frequency × Inverse Document Frequency. Rewards terms that are frequent in a document but rare across the corpus.

Embedding

A dense vector representation of text in a high-dimensional space where semantic similarity equals geometric proximity.

Perplexity

A measure of how well a probability model predicts a sample. Lower perplexity = better model fit.

Coherence

A measure of topic interpretability — how semantically related the top words of a topic are to each other.

N-gram

A contiguous sequence of N words. Bigrams (N=2) like "machine learning" capture phrase-level meaning.

Dirichlet Prior

The α (alpha) and β (beta) hyperparameters in LDA controlling the sparsity of topic and word distributions.

Compound Score

VADER's normalized sentiment score ranging from -1 (most negative) to +1 (most positive).

FAQ & Privacy

Common questions · Data notice

Show

Frequently Asked Questions

Why do I get 0 topics or very few topics from BERTopic? +

This almost always means min_topic_size is too large relative to your corpus. HDBSCAN requires at least min_topic_size documents to form a cluster — if your corpus has 50 documents and min_topic_size is 20, only themes with 20+ documents will appear as topics; everything else becomes outlier noise (-1). Try reducing min_topic_size to 5–10 for small corpora. Also check that your ngrok/Colab URL is active — a disconnected backend will return an empty response silently.

What is the difference between BERTopic and LDA? +

BERTopic uses deep neural embeddings (BERT) to understand the meaning of words in context. It is more accurate, especially for complex or ambiguous language, but requires a GPU and significant computation time. It works best on corpora of 500+ documents.

LDA treats each document as a bag of words — word order and context are ignored. It is much faster, runs on any machine, and works well on corpora as small as 50 documents. For most research use cases, LDA is the practical starting point and BERTopic is used to validate or deepen the findings.

How many documents do I need for good results? +

It depends on the method. LDA works from around 50 documents, though 200+ produces more stable results. BERTopic needs at least 100–200 documents and performs best above 500. TF-IDF and Sentiment Analysis work on any number of documents — even a single file. Word Cloud follows LDA's requirements since it runs LDA internally. As a general rule: more documents → more reliable topics.

Why do my topics all contain the same words? +

This is the most common preprocessing problem. Domain-specific high-frequency words pass through the stopword filter because they are not in the standard English stopword list. The fix is to add these words to the Custom Stopwords field (comma-separated) and rerun. Also try enabling Lemmatization and lowering Max Doc Frequency to automatically exclude corpus-wide terms.

How do I choose the right number of topics (K) for LDA? +

Use the Coherence Score tool — it automatically tests a range of K values and identifies the one producing the most semantically coherent topics. As a starting heuristic, try K ≈ √(N/2) where N is the number of documents (e.g. 200 documents → start around K = 10). Always inspect the actual topic words after selecting K — a coherence-optimal K that produces uninterpretable topics is less useful than a slightly lower K with clearer themes.

Why does VADER give wrong sentiment scores on my texts? +

VADER was designed for social media and informal English. It struggles with sarcasm, formal academic writing, domain-specific language, and non-English text. If your corpus uses specialised vocabulary or implied sentiment, switch to RoBERTa (requires Colab). For non-English text, neither model will perform reliably.

What file formats are supported and which is fastest? +

TXT is fastest — plain text requires no parsing. XLSX is medium speed. PDF is slowest — the binary structure is parsed page by page, and scanned/image-based PDFs return empty text. Use the ⚡ to TXT button next to any uploaded PDF to convert it client-side before running analysis.

Why is the Colab backend showing as unreachable? +

ngrok URLs are session-based — they expire when the Colab runtime is idle or restarted. Each time you rerun the notebook, a new URL is generated. Click ⚙ Advanced, paste the new URL from your Colab output, and click Test Connection. The Render backend for LDA/VADER is always-on but may take 30–60 seconds to wake from idle.

🔒 Privacy & Data Notice

Your files never leave your session. All documents you upload are sent directly to the analysis backend solely for processing. No file content is stored, logged, or retained after the analysis completes. Results exist only in your browser and are discarded when you close or refresh the page.

No personal data is collected. TopicMiner does not use cookies, does not track usage, and does not store any identifying information. There is no account system and no database. Each session is completely stateless.

Colab backend: When using BERTopic via Google Colab, your documents are processed on a Google-provided virtual machine that you control. The Colab session and all its data are deleted when the runtime is terminated.

About TopicMiner

Research platform · Open tool

Show

What is TopicMiner?

TopicMiner is a free, open-access NLP research platform built for academics, students, and researchers who need to extract structure and meaning from text corpora without requiring programming knowledge. It brings together five state-of-the-art analysis methods — BERTopic, LDA, TF-IDF, Word Cloud, and Sentiment Analysis — in a single unified interface, each grounded in peer-reviewed research and implemented according to best practices from the computational linguistics literature.

Unlike commercial tools such as MonkeyLearn, IBM Watson, or Google Cloud NLP, TopicMiner is entirely free, requires no API keys or subscriptions, and is designed specifically for academic and research use cases rather than business analytics. Every result is explainable and traceable back to the underlying model parameters.

Who is it for?

TopicMiner is designed for researchers and graduate students in linguistics, social sciences, education, communication, and any field that involves qualitative text analysis. It is particularly suited for thesis and dissertation work, where systematic, reproducible, and methodologically grounded text analysis is required but computational resources and programming expertise may be limited.

It is also useful for educators teaching NLP concepts, as the explainer panels provide academic citations and plain-language explanations of each method alongside the live analysis tool.

What makes it different?

Research-grade methods — BERTopic with real BERT embeddings via GPU, not simplified approximations
Academic citations — every method, parameter, and recommendation is grounded in peer-reviewed literature
No code required — full NLP pipeline accessible through a clean web interface
Transparent preprocessing — lemmatization, stopwords, n-grams, and all parameters are fully configurable and documented
Completely free — no account, no subscription, no data collection
Open architecture — built on FastAPI, scikit-learn, BERTopic, and Gensim; fully inspectable and extensible

Built by

Charalampos Mentsios — developed TopicMiner as a research tool to support NLP-based text analysis in academic settings. The platform was built from the ground up using FastAPI for the backend, deployed on Google Colab (GPU) and Render, with a custom HTML/CSS/JS frontend designed for clarity and usability in research contexts.

Feedback, suggestions, and collaboration inquiries are welcome via LinkedIn.

FastAPI BERTopic scikit-learn Gensim NLTK Render Google Colab PDF.js

UncoverHidden Structurein Text Corpora

Upload Corpus

Analysis Method

Configuration

How to Use TopicMiner

References & Further Reading

Glossary of Key Terms

FAQ & Privacy

Frequently Asked Questions

About TopicMiner

Uncover
Hidden Structure
in Text Corpora