Note that Claude Shannon's MS thesis was about re-discovering the work of an obscure British analytic philosopher, whose work from about 100 years earlier had been almost completely forgotten. (perhaps a few philosophers and mathematicians remembered Boole, but they certainly didn't teach his work to the engineers who had to design relay-based logic circuits back in the pre-transistor days)
Personally I'm a proponent of representing academic knowledge in knowledge graphs, and this site does just that - https://orkg.org/
I've just launched a site to find code repositories linked to academic papers and to summarise key paper attributes. In the future I intend to integrate a hypothesis generator - https://researchlit.com
Been there done that. At least for life science / health publications. The article is spot on.
Not sure if there is value of that approach in other more rigorous fields but in health for sure it does. The knowledge in health science is generally fragmented and a way to connect islands of knowledge has the potential to unlock a lot of value.
If you would like to see how this article ideas are applied in a playful manner in a web application you can visit: https://www.biovista.com/vizit/
These datasets are all biased towards work published in the digital age, but it's important to note that work is coming out much faster now than it used to.
Is that because there is a pressure to publish? As I wouldn't say we make advancements at a rate any different during the last two decades than we have over the 20 years prior to that.
If 1% of the last 10 billion people to live were academics and published on average 5 papers (many only had one, i.e. their dissertation/thesis, but a small fraction will have had dozens or hundreds), that comes to 500 million.
I'm curious, do you think it's an order of magnitude too low or too high?
Excellent Article! Definitely needs to be read a few times to get the gist.
In this context folks might find a previous methodology from the Soviet era named TRIZ highly relevant - https://en.wikipedia.org/wiki/TRIZ
TRIZ (/trɪz/; Russian: теория решения изобретательских задач, romanized: teoriya resheniya izobretatelskikh zadach, lit. 'theory of inventive problem solving') is a methodology which combines an organized, systematic method of problem-solving with analysis and forecasting techniques derived from the study of patterns of invention in global patent literature.
TRIZ developed from a foundation of research into hundreds of thousands of inventions in many fields to produce an approach which defines patterns in inventive solutions and the characteristics of the problems which these inventions have overcome.
Note that Claude Shannon's MS thesis was about re-discovering the work of an obscure British analytic philosopher, whose work from about 100 years earlier had been almost completely forgotten. (perhaps a few philosophers and mathematicians remembered Boole, but they certainly didn't teach his work to the engineers who had to design relay-based logic circuits back in the pre-transistor days)
For those interested in delving further into LLM Scientific Discovery there is a great github repo grouping research papers on this very topic - https://github.com/HKUST-KnowComp/Awesome-LLM-Scientific-Dis...
Personally I'm a proponent of representing academic knowledge in knowledge graphs, and this site does just that - https://orkg.org/
I've just launched a site to find code repositories linked to academic papers and to summarise key paper attributes. In the future I intend to integrate a hypothesis generator - https://researchlit.com
Been there done that. At least for life science / health publications. The article is spot on.
Not sure if there is value of that approach in other more rigorous fields but in health for sure it does. The knowledge in health science is generally fragmented and a way to connect islands of knowledge has the potential to unlock a lot of value.
If you would like to see how this article ideas are applied in a playful manner in a web application you can visit: https://www.biovista.com/vizit/
>There are on the order of 100 million papers [reference 2] published to date.
Does anyone else feel as if this (admittedly rough) estimate is off by an order of magnitude?
OpenAlex has 240M. https://docs.openalex.org/api-entities/works
CORE has 431M. https://core.ac.uk/data
Crossref has 165M. https://www.crossref.org/blog/2025-public-data-file-now-avai...
These datasets are all biased towards work published in the digital age, but it's important to note that work is coming out much faster now than it used to.
So indeed, order 10^9 not 10^8, given the CORE at > sqrt(10)*10^8.
Is that because there is a pressure to publish? As I wouldn't say we make advancements at a rate any different during the last two decades than we have over the 20 years prior to that.
If 1% of the last 10 billion people to live were academics and published on average 5 papers (many only had one, i.e. their dissertation/thesis, but a small fraction will have had dozens or hundreds), that comes to 500 million.
I'm curious, do you think it's an order of magnitude too low or too high?
I think it's too low.
MEDLINE (health / life science) has 37M papers.
IIRC the rate of publishing was superlinear thus the curve of actual publications goes faster than the quadratic function.
Excellent Article! Definitely needs to be read a few times to get the gist.
In this context folks might find a previous methodology from the Soviet era named TRIZ highly relevant - https://en.wikipedia.org/wiki/TRIZ
TRIZ (/trɪz/; Russian: теория решения изобретательских задач, romanized: teoriya resheniya izobretatelskikh zadach, lit. 'theory of inventive problem solving') is a methodology which combines an organized, systematic method of problem-solving with analysis and forecasting techniques derived from the study of patterns of invention in global patent literature.
TRIZ developed from a foundation of research into hundreds of thousands of inventions in many fields to produce an approach which defines patterns in inventive solutions and the characteristics of the problems which these inventions have overcome.
References:
TRIZ 40 Principles examples for various Domains - https://web.archive.org/web/20111203105442/http://www.triz-j...
TRIZ and Software - 40 Principle Analogies, Part 1 - https://web.archive.org/web/20120130205515/http://www.triz-j...
TRIZ and Software - 40 Principle Analogies, Part 2 - https://web.archive.org/web/20120131003258/http://www.triz-j...