Data Sources
Last updated: 29 April 2026
Scholise retrieves bibliographic metadata from established, openly-accessible academic databases. Every source shown in the application has a verifiable identifier wherever available — we do not intentionally fabricate citations.
1. Academic database metadata
Scholise reads from large, open academic catalogues that index hundreds of millions of scholarly works across all disciplines.
Scholise retrieves:
- Title, authors, publication date, and journal/venue name
- Paper ID and open-access status
- Citation count and source type (journal article, book chapter, proceedings, etc.)
- Institutional affiliations of authors
- Abstract snippets (inverted index format, reconstructed for display)
2. Academic summary metadata
Scholise also uses trusted academic metadata sources for summaries and citation context across science and medicine.
From academic summary metadata, Scholise retrieves:
- Title, authors, publication year, and abstract
- Citation count and paper ID (when available)
- TLDR — short AI-generated summaries displayed as "AI Summary" on result cards
Results are merged across academic databases by paper ID so the same paper does not appear multiple times.
3. Publisher record metadata
Scholise uses publisher record metadata to improve citation quality and source details.
From publisher record metadata, Scholise retrieves:
- Title, authors, publication date, and publisher name
- Paper ID, ISSN, and container (journal) title
- Citation count and document type
- Author affiliations when available
4. PubMed (NIH/NCBI)
For biomedical, clinical, nursing, and health-related queries, Scholise searches PubMed, the NIH's database of 35M+ biomedical and clinical papers. PubMed is included automatically when your research question relates to medicine, health, or life sciences — no separate account required.
From PubMed, Scholise retrieves:
- Title, authors, publication date, and journal name
- PubMed ID (PMID) and DOI when available
- Abstract text for top results
- Publication type (e.g. RCT, systematic review, meta-analysis)
5. arXiv (STEM preprints)
For STEM, computer science, physics, and mathematics queries, Scholise searches arXiv, an open-access repository of 2M+ preprints. arXiv results are always labelled as preprints in the app — they may not yet be peer-reviewed.
From arXiv, Scholise retrieves:
- Title, authors, publication date, and abstract
- arXiv ID and DOI when available
- Subject categories (e.g. cs.AI, q-bio)
6. Unpaywall
Unpaywall is a database of open-access versions of scholarly articles. For every paper in search results that has a paper ID, Scholise checks Unpaywall for a free legal full-text PDF link.
From Unpaywall, Scholise retrieves:
- Open-access status (is_oa)
- Direct PDF URL when a free legal full-text version exists
When a free PDF is available, we display a green "Free PDF" button on the result card. Results are cached for 30 days to reduce API load. Unpaywall does not require an API key; we identify ourselves with a contact email as requested by their API terms.
7. Assisted discovery
In addition to direct database retrieval, Scholise uses assisted discovery workflows to surface additional scholarly sources across the open web. This may include results from:
- Google Scholar and PubMed
- ResearchGate, SSRN, and arXiv
- University institutional repositories
- Publisher websites and conference proceedings
How it works: your research question may be used to discover academic pages, then extract structured bibliographic data (title, authors, paper ID, year, venue, abstract). These results are normalised into the same format as database records and merged with de-duplication.
Assisted query expansion may generate more specific academic search phrasing, including discipline terminology and synonyms, to improve retrieval quality.
Important: assisted-discovery sources are extracted from web pages and may occasionally contain incomplete or slightly inaccurate metadata. Always verify these sources using their paper link or publisher page before citing them in academic work.
8. Workspace feature data use
Data from these sources powers multiple Scholise workspaces:
- Sources — search, filtering, ranking, and save/verify flows.
- Research Assistant — sourced responses, citation panels, follow-ups, and search transparency metadata.
- Evidence Table, Outline, Citation Checker, Counter-Evidence — analysis and writing support features that reference your saved source set.
- References / Exports — citation-ready output based on source metadata.
9. Citation Checker
The Citation Checker feature analyses your pasted text sentence by sentence, labelling each as Supported, Needs Citation, Over-claiming, Low Confidence, or Opinion. It uses Claude to classify your sentences and match them against your project's saved sources.
Data used: Your pasted draft text and the titles of sources saved in your project are sent to the AI provider. The AI returns sentence-level labels and suggested source IDs. When displaying suggested sources, we also fetch citation count, reference count, and free PDF links from our cached paper metadata (academic metadata providers) and Unpaywall data — the same sources used for search results.
10. De-duplication
Because a single paper often appears in multiple providers, Scholise merges duplicates by matching on paper ID. When the same work is found in multiple sources, we merge the records and pick the most complete metadata from each. The provenance badge in the app indicates the origin:
- Database A — metadata came exclusively from one academic database.
- Database B — metadata came exclusively from a publisher metadata source.
- S2 — metadata came from an academic summary source (TLDR summaries and citation counts may be included).
- Assisted — metadata was found by assisted web discovery only.
- Both — metadata was found in multiple sources and merged.
9. Paper ID requirement
Scholise strongly prioritises sources with a paper ID. Paper IDs provide a permanent, resolvable link to the publisher's landing page. Sources without a paper ID may still appear if they have sufficient metadata, but paper-ID-backed sources are ranked higher because their provenance is verifiable.
10. Open-access indicators
Where available, Scholise displays an open-access badge for sources that can be read without a subscription. This information comes from open-access datasets and may include links to free full-text versions hosted on publisher sites, institutional repositories, or preprint servers.
For papers with a paper ID, we also check Unpaywall for a free legal PDF. When available, a green "Free PDF" button appears on the result card, linking directly to the full-text document.
11. Peer-reviewed focus
Scholise uses a heuristic scoring system to prioritise sources that are likely peer-reviewed. Signals include:
- Document type (journal articles, conference proceedings, book chapters score higher)
- Publisher type (university presses and known academic publishers score higher)
- Presence of an ISSN (indicates a serial publication, often peer-reviewed)
- Institutional affiliations of authors
- Citation count (well-cited work is more likely to have undergone review)
This is a heuristic, not a guarantee. The scoring helps surface quality sources, but it cannot certify that any individual work has been peer-reviewed. If peer-review status is critical for your use case, verify it directly with the publisher or journal.
12. Source verification
Every source in Scholise links to verifiable external records:
- Paper ID resolution — clicking a paper ID link resolves to the publisher's landing page via doi.org.
- Publisher information — publisher metadata sources provide publisher and member details for each registered paper ID.
- Database work page — each source has a corresponding academic database record with full bibliographic detail, citation graphs, and related works.
We encourage users to follow these links to verify any source before relying on it in their work.
13. Limitations
- Metadata quality depends on what publishers register and what academic databases index. Some fields (abstracts, affiliations) may be incomplete.
- Very recent publications may take days or weeks to appear in either database.
- Non-English sources are indexed but may have less complete metadata.
- Grey literature, theses, and preprints without paper IDs have limited coverage.
- Assisted-discovery sources are extracted from web pages and may occasionally have incomplete metadata (e.g. missing abstracts or approximate citation counts).
14. Questions
If you notice incorrect metadata or have questions about our data sources, contact us at support@scholise.com.