Sci-Hub is the most controversial project in today science. The goal of Sci-Hub is to provide free and unrestricted access to all scientific knowledge ever published in journal or book form.

Today the circulation of knowledge in science is restricted by high prices. Many students and researchers cannot afford academic journals and books that are locked behind paywalls. Sci-Hub emerged in 2011 to tackle this problem. Since then, the website has revolutionized the way science is being done.

Sci-Hub is helping millions of students and researchers, medical professionals, journalists and curious people in all countries to unlock access to knowledge. The mission of Sci-Hub is to fight every obstacle that prevents open access to knowledge: be it legal, technical or otherwise.

To get more information visit the about Sci-Hub section.

Thank you for joining Sci-Hub mailing list!

Database, Journal, & Article Searching

  • Start Here!
  • Find Journals
  • Scholarly Articles
  • Find Full-Text
  • Search by DOI or PMID
  • Boolean Searching
  • Google Scholar

Search by PMID or DOI

Did you know....

If you find an article that has a PMID or a DOI and aren't sure if we have it you can use the Citation Linker or Libkey.io to search the library resources. If the library doesn't have it, you will be directed to Interlibrary Loan so you can request the article.

  • Libkey.io This link opens in a new window Update 2022: Libkey has partnered with Retraction Watch to indicate retracted articles. Instant access to millions of articles provided by your library. Search by DOI or PMID.
  • Citation linker If you already know specific citation information, such as the DOI, or PMID, or the title, author, and journal, you can enter that information in this citation linker.

Libkey.io - Search by DOI or PMID

libkey flame icon

Lookup a journal article by DOI or PMID

  • << Previous: Find Full-Text
  • Next: Boolean Searching >>
  • Last Updated: Jul 3, 2024 3:06 PM
  • URL: https://molloy.libguides.com/searching

“The only truly modern academic research engine”

Oa.mg is a search engine for academic papers, specialising in open access. we have over 250 million papers in our index..

Panda

Find scientific papers by searching here or download the Chrome extension

Unlocking knowledge: your gateway to open access scientific papers and research data, introduction.

In the digital era, the quest for knowledge and scientific discovery is no longer confined to the walls of academia and research institutions. Welcome to [Your Website Name] , a dedicated platform for finding and downloading open access scientific papers and other research data. Our mission is to democratize access to scientific information, making it freely available to researchers, students, and curious minds across the globe.

What is Open Access?

Open Access (OA) refers to the practice of providing unrestricted access via the Internet to peer-reviewed scholarly research. OA content is available to all, without the usual financial or legal barriers. We believe that open access is crucial in fostering a culture of knowledge sharing and collaboration, thereby accelerating innovation and discovery.

Types of Open Access:

  • Gold Open Access: Papers are published in open access journals that provide immediate open access to all of their articles.
  • Green Open Access (Self-Archiving): Authors publish in any journal and then self-archive a version of the article for free public use in their institutional repository or on a website.
  • Hybrid Open Access: Some articles in a subscription journal are made open access upon the payment of an additional charge.

Downloading Resources

  • Direct Downloads: Once you find a paper or dataset, download it directly.
  • Citation Tools: Easily export citations in various formats to incorporate them into your research.

Open Access

Open access in scientific publishing represents a transformative approach that breaks down traditional barriers to knowledge dissemination. It is a movement dedicated to making scientific research freely available to all, fostering a more inclusive and collaborative scientific community. At its core, open access allows for the unrestricted sharing of research findings, enabling scientists, academicians, and the general public to access and utilize scientific papers without the constraints of subscription fees or licensing restrictions. This paradigm shift in scholarly communication is driven by the belief that knowledge, particularly that which is publicly funded, should be a communal resource, accessible to everyone for the greater good of society.

In the realm of scientific research, open access has numerous advantages. It accelerates the pace of discovery by allowing researchers to build upon existing work without delay, facilitating interdisciplinary collaboration and cross-pollination of ideas across various fields. This is particularly crucial in addressing global challenges, where rapid and unencumbered access to research can lead to faster solutions. Furthermore, open access democratizes knowledge by making it available to researchers in developing countries who may not have the resources for expensive journal subscriptions, thereby narrowing the research gap between high and low-income countries.

The open access model also aligns with the digital age's ethos of openness and transparency. It enables a more efficient validation and critique process, as a larger audience can scrutinize and contribute to the research. This can lead to higher quality and more reliable scientific work. Moreover, it provides an equal platform for emerging researchers and institutions to share their findings, ensuring that the visibility and impact of research are not confined to those within well-funded, prestigious entities.

However, the transition to open access is not without challenges. The sustainability of publishing models, quality assurance, and equitable distribution of costs are ongoing concerns. Despite these hurdles, the open access movement is gaining momentum, driven by the global scientific community's commitment to an open, accessible, and collaborative future in research. As we move forward, open access stands as a beacon of progress, symbolizing a world where knowledge is a shared and freely accessible asset, driving innovation and societal advancement.

  • Main website
  • Rationalwiki / Oliver Smith
  • Donation options
  • In the media
  • Relation to Søren Kierkegaard?

How to use Sci-hub to get academic papers for free

  • Post author: Emil O. W. Kirkegaard
  • Post published: 10. April 2018
  • Post category: Science

I regularly tell people on Twitter to use Sci-hub when they say they can’t access papers:

Yes you can, use Sci hub like normal people do. — Emil O W Kirkegaard (@KirkegaardEmil) April 10, 2018

However, it seems that people don’t really know how to use Sci-hub. So here is a simple, visual guide.

1. Go to the Sci-hub website

The URL may change to the website because the lobbyists of Big Publish (Elsevier, SAGE etc.) constantly try to get government to censor the website as it cuts into their rent-seeking profits . You can find the latest URLs via this handy website called Where is Sci-Hub now? (alternatively, via Wikipedia ). Currently, some working URLs are:

  • https://sci.hubg.org/
  • https://sci-hub.yncjkj.com (global)
  • https://sci-hub.mksa.top/ (global)
  • https://sci-hub.it.nf/
  • https://sci-hub.st/ ( São Tomé and Príncipe )
  • https://sci-hub.do (Dominican Republic)
  • https://sci-hub.se/ (Sweden)
  • https://sci-hub.shop (global) [redirects]
  • https://scihub.bban.top (global) [redirects]
  • https://scihub.wikicn.top/ (global) [redirects]
  • https://sci-hub.pl/ (Poland)
  • https://sci-hub.tw (Taiwan)
  • https://sci-hub.si (Slovenia)
  • https://mg.scihub.ltd/ (global)

If your country blocks the website, use one of the many free general purpose proxies. I tested hide.me for the purpose of writing this article and it works fine for Sci-hub using the Netherlands exit.

2. Go to the journal publisher’s website

Go to the website of whatever article it is you are trying to get. Here we pretend you want the article in my tweet above:

  • Seeber, M., Cattaneo, M., Meoli, M., & Malighetti, P. (2017). Self-citations as strategic response to the use of metrics for career decisions . Research Policy.

The website for this is sciencedirect.com which is Elsevier’s cover name. Then, you locate either the URL for this (i.e. https://www.sciencedirect.com/science/article/pii/S004873331730210X) or the article’s DOI. The DOI is that unique document identifier that begins with “10.”. It is almost always shown somewhere on the site, so you can use search “10.” to find it. In rare cases, it is in the page’s source code or may not exist. If it doesn’t exist, it means you usually can’t get the article thru Sci-hub. When you have the article’s URL/DOI, you simply paste this into the Sci-hub search box. Like this:

download research papers using doi

Then you click “open” and you should get something like this:

download research papers using doi

In some cases, this may not work. The APA journals seem to cause issues using the URL approach, so use the DOI approach. Sometimes Sci-hub returns the wrong article (<1% I should guess).

Finding articles from APA journals

These journals refuse to give a DOI and they don’t work with URL either usually. Example this paper . A workaround is to search Crossref for the title which gives the DOI, then use the DOI to fetch the paper as usual:

download research papers using doi

You Might Also Like

How to find scientific literature, noah carl crowdfunding online, do music genres exist an outline of an empirical approach.

  • Open Access Button

For Libraries

The Open Access Button is now built by OA.Works . Same people, new name! Read more about our rebrand.

The Open Data Button has now merged with the Open Access Button. Your account and request will stay the same, but you'll need to get the new plugin. For more on the changes see our blog .

Thanks for your support! Team Button has now merged with the Open Access Button and our Request system.

Your Account

download research papers using doi

Avoid Paywalls, Request Research.

Free, legal research articles delivered instantly or automatically requested from authors..

Searching thousands of repositories for access !

Give us a moment.

Get around this paywall in a flash: DOI: 10.1126/science.196.4287.293 URL: http://science.sciencemag.org/content/196/4287/293/tab-pdf PMC (Pubmed Central) ID: PMC4167664 Pubmed ID: 17756097 Title: Ribulose bisphosphate carboxylase: a two-layered, square-shaped molecule of symmetry 422 Citation: Baker, T. S., Eisenberg, D., & Eiserling, F. (1977). Ribulose Bisphosphate Carboxylase: A Two-Layered, Square-Shaped Molecule of Symmetry 422. Science, 196(4287), 293-295. doi:10.1126/science.196.4287.293 or try your favourite citation format (Harvard, Bibtex, etc).

Check out some of our latest requests .

Finding Available Research

Give us a scholarly paper and we’ll search thousands of sources with millions of articles to link you to free, legal, full text articles instantly.

Requesting Research

If we can’t get you access, we’ll start a request for you. We request articles from authors, and guide them on making the work available to you and everyone who needs it.

You can do this from our website, browser extensions, tools for libraries or our API . Take your pick or learn more.

Proudly non-profit • Open source • Library-aligned

Built by OA.Works

Proudly non-profit · open source · library-aligned

About · Requests · API · Bugs · Twitter · Account Login · Status

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Is there a simple way to bulk download a large number of papers from a list of references

I've got a library of 1200 references I'm using for a systematic review. Now I need to download the PDFs of all these references, which will take days if I do it manually. Is there a simple way to automatically download as many as possible from pubmed / Google Scholar / (maybe Scihub)? I have institutional access.

Edit: My solution was to load the reference list into multiple reference managers and run the PDF import function in all of them. Some managers succeeded where others failed. I had to do the rest manually.

  • reference-managers

Nereus's user avatar

  • Did you try to ask your librarian? –  EarlGrey Commented May 24, 2023 at 11:56
  • 2 If you want to do a systematic review of these 1200 works I suspect you would need to read at the very least the abstract but presumably significant parts of the text in each of them. Relative to that effort, the downloading is completely trivial. –  quarague Commented May 24, 2023 at 12:43

2 Answers 2

For systematic review, if you're following PRISMA, you'll typically do some preliminary 'checks' before getting to the lists for full text review.

I'm assuming the 1200 odd references are your final list after the duplicate removal and screening, and perhaps your forward-and-reverse literature chaining.

In Zotero, you can enable automatic PDF download in preference. For your purpose, you then bulk import reference using doi or bibtex or ris .

  • Someone recently developed a working script for bulk adding doi and updating metadata

The trick with Zotero is that Zotero is able to download pdf link to an entry from multiple sources. There are limitations though. Beware that most academic database would lock you out if performing large bulk downloads at fast rate. Always a good idea to use proxy. In your case, you already having institutional access which might assist.

With Endnote, you can bulk import PDF files or like you have in Mendeley, you can set a 'watch' on a folder from which Endnote will automatically import entries for PDF files added to the folder. Unfortunately, that does not address your challenge. With Endnote, similar to Zotero, you can import reference list to populate tour Endnote database.

  • simply export reference list from your search (Google, Scopus ...) to a RIS file.

[Technical approach beyond the scope of Academia forum] For other technical solutions beyond the scope of Academia, you can work directly with API of academic database.

  • Science direct and Scopus provide API access, which you register for for free. You'll still need access to perform low-level tasks and download.
  • you can leverage Python to work with academic database API
  • for Google Scholar, use Scholarly: Scholarly pypi , GitHub
  • for Scopus, pybliometrics is well used.
  • Pyscopus claims to be more friendly. I'm yet to use Pyscopus unlike others. More so, it's inactive since 2019!

There's one I've used recently, just can't recall the name offhand. It allows robust analysis and topic search. I'll update in due course.

[Scientific PDF download]

  • RESP: Research Papers Search claims to search and download scientific papers. Yet to try it out.
  • Articledownloader is worth exploring
  • PyPaperBot is well used for downloading scientific articles from DOI or academic database.

I'm busy with a fork of Automated Search Helper . A research project by Lech Madeyski team at Wroclaw University of Science and Technology, Poland. I'm yet to upload latest revision which has the

  • pdf downloader working with JSGlue, Jinja2
  • I have it working locally but need some code clean-up and documentation.

NB: with automatic downloaders, beware of captcha and blocking/ban by academic database

SciPDFParser comes across as a good parser of downloaded articles PDF.

semmyk-research's user avatar

I don't know of any tool to specifically scrape academic databases for pdf's. There may be some obscure program out there on GitHub or a web-crawler that could be repurposed. This question is a bit dated but addresses your problem in more detail and proposes some interesting solutions in that vein.

The easiest off the shelf solution is Endnote. It has a feature that allows for automatic search and retrieval of pdf's. If you have access, it works fairly well. Though it doesn't capture everything . I suspect that there are other reference managers with a similar feature. I don't know of free ones specifically, if that is a concern.

If none of those options are workable for you, consider if it is necessary to download all those pdf's. I'm assuming that you are just beginning to conduct your screening and so you likely don't need the full texts right away. I have conducted a handful of systematic reviews and I have always relied on title and abstract for the initial screen. If I could not make a decision from that info, simply navigating to the original online version was sufficient. Since you already have institutional access, why store them on your local device from the get-go? It would be significantly easier to download and store the papers you flag for further review or inclusion. This may not be right in your case, but it's something to consider if all else fails.

sErISaNo's user avatar

You must log in to answer this question.

Not the answer you're looking for browse other questions tagged citations reference-managers sci-hub pubmed ..

  • Featured on Meta
  • We've made changes to our Terms of Service & Privacy Policy - July 2024
  • Announcing a change to the data-dump process
  • Upcoming initiatives on Stack Overflow and across the Stack Exchange network...

Hot Network Questions

  • Why is "hidden dependency" (required things not in parameter list directly) a disadvantage of "global variables", but not in "preserve whole object"?
  • Earth’s orbit (JPL Horizon system)
  • Windows 11, applications take a few seconds to open, but only the first time I open them
  • Short story where only women are able to do intergalactic travel
  • How long should I boil a liquid mixture containing vanilla extract to vaporize the alcohol, when making ice cream?
  • DC motor pump always transports in same direction, regardless of polarity
  • Table saw trips circuit breaker
  • Is the trinity an unfalsifiable belief?
  • Is it safer to sarcastically say "This is not a scam" than honestly say "This is a scam"?
  • How will a very short undergrad impact PhD applications?
  • Draw Small Regular Polygons
  • What is so fundamental about polynomial functions that they are used to demarcate the Hardness boundary in NP complexity classes?
  • Does GDPR's "Legitimate Interest" allow spam?
  • Handmade number sequence puzzle - but I broke it!
  • Is my EV subpanel properly wired?
  • Did the Romans have anything artificial? Which words did they use to describe it?
  • Substitute for saltpetre
  • Why did the Hyperbola-1 launch fail in August 2021?
  • Hypothesis and Scientific Method
  • JSON Web Encryption (JWE) vs HTTPS? or Both
  • Jurisdiction: Can police officers open mail addressed to a stranger?
  • Why is it inadvisable to apologise for a car accident?
  • Did the BBC ask for extra line spacing?
  • "Man cannot live by bread alone." — Is there any logical explanation why this idiom doesn't have an article before "man"?

download research papers using doi

Academia Insider

Best Websites To Download Research Papers For Free: Beyond Sci-Hub

Navigating the vast ocean of academic research can be daunting, especially when you’re on a quest for specific research papers without the constraints of paywalls. Fortunately, the digital age has ushered in an era of accessible knowledge, with various platforms offering free downloads of scholarly articles.

In this article, we explore some of the best websites that provide researchers, students, and academicians with free access to a plethora of research papers across diverse fields, ensuring that knowledge remains within everyone’s reach.

Best Websites To Download Research Papers For Free

PlatformFeatures
– Hosts diverse academic papers.
– Free access to many scholarly articles.
– Links to open-access resources.
– Combines social networking with research.
– Direct downloads of open-access papers.
– Allows requests for papers from authors.
– Open-access article repository.
– Direct download of free PDFs.
– Search using keywords, DOI, or journals.
– Extensive open-access journal repository.
– Free download of scholarly articles.
– Advanced search by keywords, publisher, language.
– Focus on medicine and life sciences.
– Lists open-access and subscription articles.
– Free full-text links and integration with Unpaywall.
– Free access to paywalled articles.
– Uses DOI for article retrieval.
– Legal and ethical considerations.

Google Scholar

As a researcher, you might find Google Scholar to be a repository brimming with academic papers covering a broad span of domains like social sciences, computer science, and humanity, including:

  • Journal articles
  • Conference papers, and

Unlike other websites to download research papers, Google Scholar provides free access to a vast collection of scholarly literature, making it one of the best websites to download research.

Not every article is available in full PDF format directly; however, Google Scholar often links to other open access resources like DOAJ (Directory of Open Access Journals) and open-access repositories where you can directly download papers.

For instance, if you’re searching for a specific 2023 research paper in mathematics, you can use Google Scholar to locate the paper and check if it’s available for free download either on the platform itself or through links to various open access sources.

In many cases, Google Scholar integrates with tools like Unpaywall and Open Access Button, which are browser extensions that help you find free versions of paywalled articles.

These extensions often redirect you to open-access content, including those on platforms like Sci-Hub and Library Genesis, although it’s crucial to be aware of the legal and ethical implications of using such services.

ResearchGate

ResearchGate is a unique platform that blends social networking with academic research, making it an essential tool for researchers and scientists across various disciplines.

download research papers using doi

Here, you have access to a digital library of millions of research papers, spanning fields from computer science to social sciences and beyond.

When you’re on ResearchGate, downloading a research paper is relatively straightforward, especially if it’s open access. Many researchers upload the full PDF of their work, providing free access to their peer-reviewed articles.

If the research paper you’re interested in isn’t available for direct download, ResearchGate offers a unique feature: you can request a copy directly from the author.

This approach not only gets you the paper but also potentially opens a line of communication with leading experts in your field.

It’s important to note that ResearchGate isn’t just a repository; it’s a platform for discovery and connection. You can:

  • Follow specific researchers
  • Join discussions, and
  • Receive notifications about new research in your domain.

While it doesn’t have the controversial direct download links like Sci-Hub or Library Genesis, ResearchGate offers a more ethical and legal route to accessing academic papers. 

ScienceOpen

ScienceOpen is a comprehensive repository that hosts a multitude of open-access research articles across various fields, from the social sciences to computer science. 

The process of downloading a research paper on ScienceOpen is remarkably straightforward. Since it’s an open-access platform, most of the papers are available to download as PDFs without any cost.

This means you can access high-quality, peer-reviewed academic research without encountering paywalls that are often a barrier in many other scientific platforms.

For instance, if you’re delving into the latest 2023 scientific papers in mathematics, ScienceOpen can be your go-to source. You can easily search for research papers using:

  • Browsing through various open access journals featured on the site.

The direct download feature simplifies access to these papers, making it convenient for you to obtain the research you need.

Directory of Open Access Journals (DOAJ)

The Directory of Open Access Journals (DOAJ) is a digital library is an extensive repository of open-access, peer-reviewed journals, covering a wide array of subjects from humanities to nuclear science.

When you’re navigating DOAJ, you’ll discover that it’s not just a platform to download research papers; it’s a gateway to a world of academic research.

download research papers using doi

Each journal article listed is freely accessible, meaning you can download these scholarly articles without any cost or subscription.

The process is simple: search for research papers using specific keywords, subjects, or even DOAJ’s advanced search functionality that includes filters like:

  • Language, or
  • The year of publication.

For example, if you’re delving into the latest developments in scientific research in 2023, DOAJ allows you to refine your search to the most recent publications.

Once you find a relevant research paper, you can easily access the full text in PDF format through a direct download link. This is particularly useful for accessing high-quality, open-access research papers that are not always readily available on other platforms like Sci-Hub or Library Genesis.

PubMed hosts millions of research articles, primarily in the fields of medicine and life sciences, but also encompassing a broad range of scientific research.

When you’re on PubMed, you can search for research papers using:

  • Authors, or
  • Specific journal names.

While PubMed lists both open-access and subscription-based journal articles, it offers a unique feature for accessing papers for free.

If you’re looking for a particular research paper, say in the domain of computer science or social sciences from 2023, you can directly access its abstract on PubMed. For open access articles, a free full-text link is often available, allowing you to download the research paper in PDF format.

PubMed integrates with tools like Unpaywall and the Open Access Button. These browser extensions help you find open-access versions of the articles you’re interested in, bypassing the paywalls that often restrict access to scholarly literature.

While PubMed itself doesn’t provide direct download links for all articles, its connection with these tools and various open access repositories ensures that you, as a researcher, have greater access to scientific papers.

Sci-Hub (with Caution)

Sci-Hub, often dubbed the ‘Pirate Bay of Science,’ has been a game-changer in the scientific community since its inception by Alexandra Elbakyan in 2011.

It operates as a controversial, yet widely used platform providing free access to millions of research papers and academic articles that are typically locked behind paywalls.

As a researcher, you might find Sci-Hub an intriguing, albeit contentious, tool for accessing scholarly literature.

When you’re looking to download a research paper from Sci-Hub, the process is relatively straightforward. Say you need a journal article on computer science or a groundbreaking study in social sciences from 2023; you just need to have the DOI (Digital Object Identifier) of the paper.

By entering this DOI into Sci-Hub’s search bar, the website bypasses publisher paywalls, offering you direct download links to PDF versions of the articles.

download research papers using doi

It’s crucial to note that while Sci-Hub provides access to a vast repository of scientific research, its legality is under constant scrutiny. The platform operates via various proxy links and has been the subject of numerous legal battles with publishers and academic institutions.

Nevertheless, Sci-Hub remains a popular go-to for researchers and scientists globally, especially those without access to university libraries or digital archives.

While it opens doors to a wealth of knowledge, users should be aware of the ethical and legal implications of using such a service in their respective countries.

Wrapping Up: You Can Get Free Academic Papers 

The digital landscape offers a wealth of resources for accessing academic research without financial barriers. The platforms we share here provide an invaluable service to the scholarly community, democratising access to knowledge and fostering intellectual growth.

Whether you’re a seasoned researcher or a curious student, these websites bridge the gap between you and the vast world of academic literature, ensuring that the pursuit of knowledge remains an inclusive and equitable journey for all. Remember to consider the legal and ethical aspects when using these resources.

download research papers using doi

Dr Andrew Stapleton has a Masters and PhD in Chemistry from the UK and Australia. He has many years of research experience and has worked as a Postdoctoral Fellow and Associate at a number of Universities. Although having secured funding for his own research, he left academia to help others with his YouTube channel all about the inner workings of academia and how to make it work for you.

Thank you for visiting Academia Insider.

We are here to help you navigate Academia as painlessly as possible. We are supported by our readers and by visiting you are helping us earn a small amount through ads and affiliate revenue - Thank you!

download research papers using doi

2024 © Academia Insider

download research papers using doi

  • Telegram Bot
  • Library Genesis
  • Unblock in India
  • Unblock in France
  • Unblock in Italy
  • About Sci-Hub

How to use Sci-Hub

Sci-Hub is a website built for downloading PDFs of journal articles and papers for free.

You want to read a scientific article or research paper, but it’s not available online, and your university or school doesn’t have a (very expensive) subscription to that particular journal. What are you supposed to do? Just not read it? That seems a bit unfair. Yes, some papers are available for purchase online, but did you know that the actual author of the paper doesn’t get any of that money? Or the peer-reviewers? I don’t mean to diminish the work that journals do, but if I have to decide between letting people read papers, or some publisher making even more money, I will always root for the person who wants to further their knowledge and studies. Downloading a PDF hurts no-one.

Step 01: Find the DOI

First of all, you need to find the DOI of the paper you are searching for.

What is a DOI? It’s numbers and letters that identify the article you’re searching for. Think of it a bit like a phone number. The way that dialling your phone number will only make your phone ring, searching for a DOI will give you exactly the article you are looking for.

How to use Sci-Hub, Step 1

The Open Sci

How to use Sci-Hub

Sci-Hub is a website created to download PDF files of articles and magazine articles for free.

You want to read a scientific article or research paper, but it’s not available online and your university or school doesn’t have a (very expensive) subscription to that particular newspaper. What should you do? You just don’t read it? It seems a bit unfair. Yes, some documents are available for purchase online, but did you know that the actual author of the document does not receive any of that money? Or peer reviewers? I don’t want to disparage the work that newspapers do, but if I have to decide between letting people read the newspapers or some publisher that makes even more money, I will always encourage the person who wants to deepen their knowledge and studies. . Downloading a PDF doesn’t hurt anyone.

First of all, you need to find the DOI of the card you are looking for.

What is a DOI? These are the numbers and letters that identify the item you are looking for. Think of it a bit like a phone number. The way you dial your phone number will only make your phone ring, searching for a DOI will give you exactly the article you are looking for.

You can find the DOI by googling the title of the document, clicking on one of the links and somewhere on the page there will be the DOI. For example, let’s say the card we want is called “Mapping the Margins” by Kimberle Crenshaw. We type it into Google and we get this:

download research papers using doi

Click on the first link and it should take you to a page with more information on that particular item. Copy the DOI number:

download research papers using doi

Then go to Sci-Hub and paste it into the search field:

download research papers using doi

That’s all! It takes us directly to the PDF. To download the PDF, simply click the download button in the PDF preview or click the left button with the down arrow.

… to remove all barriers in the way of science

Sci-hub mirror, the first pirate website in the world to provide mass and public access to tens of millions of research papers.

A research paper is a special publication written by scientists to be read by other researchers. Papers are primary sources necessary for research – for example, they contain detailed description of new results and experiments.

At this time the widest possible distribution of research papers, as well as of other scientific or educational sources, is artificially restricted by copyright laws. Such laws effectively slow down the development of science in human society. The Sci-Hub project, running from 5th September 2011, is challenging the status quo. At the moment, Sci-Hub provides access to hundreds of thousands research papers every day, effectively bypassing any paywalls and restrictions.

Sci-Hub Ideas

Knowledge to all.

We fight inequality in knowledge access across the world. The scientific knowledge should be available for every person regardless of their income, social status, geographical location and etc.

Our mission is to remove any barrier which impeding the widest possible distribution of knowledge in human society!

No Copyright

We advocate for cancellation of intellectual property , or copyright laws, for scientific and educational resources.

Copyright laws render the operation of most online libraries illegal. Hence many people are deprived from knowledge, while at the same time allowing rightholders to have a huge benefits from this. The copyright fosters increase of both informational and economical inequality.

Open Access

The Sci-Hub project supports Open Access movement in science. Research should be published in open access, i.e. be free to read.

The Open Access is a new and advanced form of scientific communication, which is going to replace outdated subscription models. We stand against unfair gain that publishers collect by creating limits to knowledge distribution.

Sci Hub Frequently Asked Questions

What is sci-hub.

Sci-Hub is the democratisation of knowledge – publicly financed research should be freely available to the public. Why are big businesses profiting from scientific research? Use this site to download papers and freely disseminate information. Open Access is supported by Sci-Hub.

How many papers are there in Sci-Hub?

Currently there are 933189871 papers and PDFs in the Sci Hub library.

What is Library Genesis?

Library Genesis , also known as http://gen.lib.rus.ec , is a pirate library.

What is SciHub?

Sci.Hub is a website where you may get free access to reference papers and journal articles. Simply paste the DOI of the journal article you wish to read into ScienceHub, and the PDF will be downloaded for free. Journal articles and academic papers are available for free on the site. These sci-hub mirrors are for you if you want to obtain knowledge. Sci Hub is the Internet's largest archive of stolen academic articles. Sci Hub and Library Genesis have made 48 million scholarly publications freely available to anybody with an Internet connection since its inception in October 2011.

Is Sci Hub blocked in India?

Unfortunately, yes. India has blocked Sci-Hub. But there are two ways to unblock Sci-Hub in India: 1. Use Google DNS as your DNS server. See the settings here: How to set Google Public DNS 2. Tor. Download it here. Open up Tor Browser and go to any of the scihub links above or this special Tor-only Sci-Hub link: https://scihub22266oqcxt.onion This should let you access scihub from India.

Is Science Hub down?

Right now, sci hub seems to be up and running. If you cannot access science hub try some of the alternative links above, like whereisscihub. It’s possible your internet provider is blocking sci-hub, in that case an alternative link might still work.

Is there a Sci Hub Extension?

Yes, there is a Chrome Extension

Is Sci-Hub illegal?

Experts tend to agree that strictly speaking Sci-Hub is illegal, yes. But the items in Sci-Hub are not contraband, they are academic articles. There is no known case of anyone getting a fine or penalty for using Sci-Hub.

How can I get free articles?

1. Go to Sci-Hub 2. Type in the article’s DOI in the search field 3. The article will be downloaded for free

Who runs SCI Hub?

Sci-Hub was founded by Alexandra Elbakyan in 2011 in response to the high cost of research papers behind paywalls.

Research guidance, Research Journals, Top Universities

How to use SCI HUB to download research papers for free

how to use sci hub

Use Sci-Hub to download research papers

The  Sci-Hub  project supports the Open Access  movement in science. It provides mass & public access to research papers.

Often we have reported that most of the research papers published by some reputed journals are paid. If anyone wants to download such manuscripts, he needs to pay to access such papers.

SCI Hub allows downloading and reading such papers for free . Sci-H ub contains most of the academic and scientific papers. What one has to do is visit the site after finding the research paper link or DOI of the journal article . You can paste the DOI or URL in the search button and click search. If the paper is available, a preview will be shown. You can download this paper and use it for your reference.

Researchers most often use SCI HUB to download research papers for free.

How to use Sci Hub?

Follow the below steps to download paid researchers papers for free using Sci-Hub.

Step 1: Go to the official website of SCI- HUB .

Step 2: Enter the Title/ DOI/ URL of the research paper which you want to download/ read using SCI HUB.

use sci hub download research papers

Step 3: Click on Open or press enter key.

Step 4: As soon as you perform step 3, the desired research paper will visible on the website. You can download the paper from click on the download icon.

Thanks for reading our article. I hope you like it.

After reading this article, now you can download IEEE/ Springer/ Elsevier paper.

Download Springer journals articles for free

Download Elsevier journals articles for free

Share this:

Leave a comment cancel reply.

Save my name, email, and website in this browser for the next time I comment.

Notify me of follow-up comments by email.

Notify me of new posts by email.

scidownl 1.0.2

pip install scidownl Copy PIP instructions

Released: Apr 2, 2023

Download pdfs from Scihub.

Verified details

Maintainers.

Avatar for Tishacy from gravatar.com

Unverified details

Project links, github statistics.

  • Open issues:

License: MIT License

Author: Tishacy

Classifiers

  • OSI Approved :: MIT License
  • OS Independent
  • Python :: 3
  • Python :: 3.5
  • Python :: 3.6
  • Python :: 3.7

Project description

An unofficial api for downloading papers from SciHub.

  • Support downloading with DOI, PMID or TITLE.
  • Easy to update newest SciHub domains.
  • Ready for changes: Encapsulate possible future changes of SciHub as configurations.
  • Support proxies.

Quick Usage

Installation, install with pip.

Scidownl could be easily install with pip.

Install from source code

Command line tool, 1. update available scihub domains.

There are 2 update modes that you could specify with an option: -m or --mode

crawl : [Default] Crawling the real-time updated SciHub domains website (aka, SciHub domain source) to get available SciHub domains. The SciHub domain source website url is configured in the global config file in the section [scihub.domain.updater.crawl] with the key of scihub_domain_source . You could use scidownl config --location to show the location of the global config file and edit it.

An example of using crawl mode:

search :Generate combinations according to the rules of SciHub domains and search for available SciHub domains. This will take longer than crawl mode.

An example of using search mode:

2. List all saved SciHub domains

SciDownl use SQLite as the local database to store all updated SciHub domains locally. You can list all saved SciHub domains with the command domain.list .

In addition to the easy-to-understand Url column, the SuccessTimes column is used to record the number of successful paper downloads using this Url, and the FailedTimes column is used to record the number of failed paper downloads using this Url. These two columns are used to calculate the priority of choosing a SciHub domain when downloading papers.

3. Download papers

Download papers with doi(s), pmid(s) or title(s).

Using option -d or --doi to download papers with DOI, option -p or --pmid to download papers with PMID, and option -t or --title to download papers with titles. You can specify these options for multiple times, and even mix of them.

Customize the output location of papers

By default, the downloaded paper is named by the paper's title. With option -o or --out ,you can customize the output location of downloaded papers, whcih could be an absolute path or a relative path, and a direcotry or a file path.

Output the paepr to a directory:

Output the paper with the file path.

NOTE that if there are more than one papers to be downloaded, the value of the --out option will always be considered as a directory, rather than a file path.

If some directories in the option are not exist, SciDownl will create them for you :smile:.

Use a specific SciHub url

With option -u or --scihub-url , you could use a specific SciHub url you want, rather than let SciDownl automatically choose one for you from local saved SciHub domains. It's recommended to let SciDownl choose a SciHub url, so you don't need to use this option in normal use.

You could use scihub_download function to download papers.

More examples could be seen in examples .

Copyright (c) 2022 tishacy.

Licensed under the MIT License .

Project details

Release history release notifications | rss feed.

Apr 2, 2023

Mar 7, 2022

May 4, 2020

Oct 2, 2019

Aug 2, 2019

Jul 30, 2019

Jul 19, 2019

May 16, 2019

Apr 20, 2019

Apr 12, 2019

Apr 11, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages .

Source Distributions

Built distributions.

Uploaded Apr 2, 2023 Source

Uploaded Apr 2, 2023 Python 3

Hashes for scidownl-1.0.2-py3.9.egg

Hashes for scidownl-1.0.2-py3.9.egg
Algorithm Hash digest
SHA256
MD5
BLAKE2b-256

Hashes for scidownl-1.0.2-py3-none-any.whl

Hashes for scidownl-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256
MD5
BLAKE2b-256
  • português (Brasil)

Supported by

download research papers using doi

  • Library Home
  • Library Guides

How do I find articles?

  • Find an Article Using a DOI or PMID
  • Getting Started
  • Article Search Strategies
  • Finding an Article from a Citation
  • EBSCO Databases
  • ProQuest Databases

What are DOIs and PMIDs

Find an article using doi or pmid.

DOI stands for Document Object Identifier . This is a unique identifier that is assigned to an online journal article, online book or online book chapter. Most publishers assign these to their online content. A DOI can take you directly to an online resource, but the Library does not always have access at a publisher site. The DOI lookup links to any online access we have.

PMID is a unique identifier used in the PubMed database and can be used to look up abstracts in PubMed. The PMID lookup links to online access through the Library.

This widget uses Libkey.io, which connects to our FIndIt service to get you to an article. All you need is a DOI or PMID.

Enter a DOI or PMID

Ask a Librarian

Profile Photo

  • << Previous: Scopus
  • Updated: Jun 24, 2024 12:04 PM
  • URL: https://guides.lib.uchicago.edu/articles
  • Report a problem
  • Login to LibApps

Open sourcetools

Download Research Papers and Scientific Articles for free (Sci-Hub and Library Genesis links updated August 2022)

people inside library

Many students and researchers need to find a paper for their research, to complete the review of an article, or while writing their thesis. Many papers can be found through your university library, but for those that you may not have access to through your institution, we take a look at the three largest open access sites, as well as sci hub and Library Genesis .

Unpaywall Unpaywall is a website built by Impactstory, a nonprofit working to make science more open and reusable online. They are supported by grants from the National Science Foundation and the Alfred P. Sloan Foundation. What they do is gather all the articles they can from all the open-access repositories on the internet. These are papers that have been provided by the authors or publishers for free, and thus Unpaywall is completely legal. They say they have about 50-85% of all scientific articles available in their archive. Works with Chrome or Firefox.

PaperPanda PaperPanda is a free browser extension for Chrome that gives you one-click access to papers and journal articles. When you find a paper on the publisher’s site, just click the PaperPanda icon and the panda goes and finds the PDF for you.

Open Access Button The Open Access Button  does something very similar to Unpaywall, with some major differences. They search thousands of public repositories, and if the article is not in any of them they send a request to the author to make the paper publicly available with them. The more people try to find an article through them, the more requests an author gets. You can search for articles/papers directly from their page, or download their browser extension.

Library Genesis Library Genesis is a database of over 5 million (yes, million) free papers, articles, entire journals, and non-fiction books. They also have comics, fiction books, and books in many non-english languages. They are also known as LibGen or Genesis Library. Many of the papers on Library Genesis are the same as sci hub, but what sets them apart is that Library Genesis has books as well.

OAmg OAmg lets you search for journal articles and papers, download them, and of course cite them in your Citationsy projects. After entering a query it searches through all published papers in the world and shows you the matches. You can then click a result to see more details and read a summary. It will also let you download the paper through a couple different, completely legal open access services. www.oa.mg

Sci-Hub (link updated August 2022) Finally, there’s Sci Hub . Science-Hub works in a completely different way than the other two: researchers, students, and other academics donate their institutional login to Schi-Hub, and when you search for a paper they download it through that account. After the articles has been downloaded they store a copy of it on their own servers. You can basically download 99% of all scientific articles and papers on SciHub. Just enter the DOI to download the papers you need for free from scihub. Shihub was launched by the researcher Alexandra Elbakyan in 2011 with the goal of providing free access to research to everyone, not only those who have the money to pay for journals. Many in the scientific community praise hub-sci / sciencehub for furthering the knowledge of humankind and helping academics from all over the world. shi hub has been sued many times by publishers like Elsevier but it is still accessible, for example by using a sci hub proxy.

You can find links to Sci-Hub on Wikipedia ( https://en.wikipedia.org/wiki/Sci-Hub ) or WikiData ( https://www.wikidata.org/wiki/Q21980377#P856 ).

Referencing and Writing Advice Unlocking Knowledge Getting the green light when using plagiarism detection software doesn’t mean you haven’t plagiarised.

download research papers using doi

Purdue University

  • Ask a Librarian

DOI / PMID Search

What are dois and pmids.

DOI and PMID refer to unique identifiers, which can be used to locate articles online. The boxes on this guide link these services to the Library's FindIt! service, allowing you to access resources through Library subscriptions.

DOI stands for  Document Object Identifier . This is a unique identifier that is assigned to an online journal article, online book or online book chapter. Most publishers assign these to their online content. A DOI can take you directly to an online resource, but the Library does not always have access at a publisher site. The DOI lookup links to any online access we have.

PMID is a unique identifier used in the PubMed database and can be used to look up abstracts in PubMed. The PMID lookup links to online access through the Library.

Search using DOI or PMID

download research papers using doi

Lookup a journal article by DOI or PMID

Instructions.

If you have a DOI or PMID for an article that you would like to obtain using Purdue Libraries subscriptions or via Inter-Library loan services, simply copy and paste the DOI or PMID in the box above and click search.

Examples to try (copy and paste these into the box above):

DOI Examples:

  • 10.1186/s12898-019-0263-7
  • 10.1016/j.seps.2021.101063     
  • 10.1016/j.fas.2020.05.008         

PMID Examples:

  • 25435309 
  • 30302018  

 You can also install the Nomad Browser Extension which will automate the searching for you as you browse the Web.

Library Search

  • Last Edited: Jun 8, 2023 3:28 PM
  • URL: https://guides.lib.purdue.edu/doipmidsearch
  • for Firefox
  • Dictionaries & Language Packs
  • Other Browser Sites
  • Add-ons for Android

Preview of SciHub Downloader

SciHub Downloader by B Akhil kumar

This add-on is not actively monitored for security by Mozilla. Make sure you trust it before installing.

Download Papers from SciHub using DOI.

Extension Metadata

Scihub Download Page

Star rating saved

  • Support Email
  • Feeds, News & Blogging
  • Search Tools
  • Social & Communication
  • See all versions

Reference management. Clean and simple.

How to find a DOI?

download research papers using doi

Location of DOIs

How to include a doi in your citation, frequently asked questions about finding dois, related articles.

A digital object identifier , or DOI, refers to a handle that recognizes a unique object in the digital world. This label is assigned by the International Organization for Standardization (ISO) to different types of scholarly material, such as papers, journal articles, books, data sets, reports, government publications, and even videos.

A DOI should always be easily available in any source. Usually, you will find it on the first page, either in the header or somewhere close to the title.

DOI in an article from Science

Alternatively, you can also find it in the "About this article" or "Cite this article" sections.

DOI in an article from Nature

If the DOI isn’t available, you can look it up on CrossRef.org by using the “Search Metadata” option. You just have to type in the source's title or author, and it will direct you to its DOI.

The correct format for adding a DOI to your citations will depend on the citation style you use. Here is a list of citation examples with DOIs in major formatting styles:

Hofman, C. A., & Rick, T. C. (2018). Ancient Biological Invasions and Island Ecosystems: Tracking Translocations of Wild Plants and Animals. Journal of Archeological Research , 26 (1), 65–115. https://doi.org/10.1007/s10814-017-9105-3

Hofman, Courtney A., and Torben C. Rick. “Ancient Biological Invasions and Island Ecosystems: Tracking Translocations of Wild Plants and Animals.”  Journal of Archaeological Research , vol. 26, no. 1, 2018, pp. 65–115, doi:10.1007/s10814-017-9105-3.

Hofman, Courtney A., and Torben C. Rick. 2018. “Ancient Biological Invasions and Island Ecosystems: Tracking Translocations of Wild Plants and Animals.”  Journal of Archaeological Research  26 (1): 65–115. https://doi.org/10.1007/s10814-017-9105-3.

The preferred format of a DOI in a citation is using https://doi.org/ followed by the alphanumeric string. It also depends on the style; as you can see that MLA prefers using doi:xxx. Make sure to double-check the citation style you use before adding the DOI.

Tip: Instead of manually adding citations with DOIs to your documents, which is error-prone and strenuous, consider using a reference manager like Paperpile to format and organize your citations. Paperpile allows you to save and organize your citations for later use and cite them in thousands of citation styles directly in Google Docs, Microsoft Word, or LaTeX, including the DOI:

The preferred format of a DOI in a citation is using https://doi.org/ followed by the alphanumeric string. Of course, it depends on the style, as MLA prefers using doi:xxx. Make sure to double check the citation style you use before adding the DOI.

The International Organization for Standardization (ISO) is responsible for assigning DOIs to different types of scholarly material, such as papers, journal articles, books, data sets, reports, government publications, and even videos.

URLs and DOIs are not the same. A DOI is a unique alphanumeric identifier that labels digital material and pinpoints its location on the internet, whereas a URL is a digital locator.

DOIs were invented for a reason. These alphanumeric identifiers allow readers to locate specific material in the digital world. They also add credibility to your sources.

Systematic literature review

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

Downloads pdfs via a DOI number, article title or a bibtex file, using the database of libgen(sci-hub) , arxiv

Licenses found

Bibcure/scihub2pdf, folders and files.

NameName
64 Commits

Repository files navigation

Scihub to pdf(beta), description.

scihub2pdf is a module of bibcure

Downloads pdfs via a DOI number, article title or a bibtex file, using the database of libgen, Sci-Hub and Arxiv.

If you want to download files from scihub you will need to get PhantomJS

Linux Using npm

Features and how to use.

Given a bibtex file

Given a DOI number...

Given a title...

Location folder as argument

Use libgen instead sci-hub

  • Annoying CAPTCHA

Download from list of items

Given a text file like

download all pdf's

  • Python 100.0%

Adhikari V  V Subba Rao

  • Tata Institute of Social Sciences(TISS)-Mumbai

How to download a full research paper using DOI number?

Most recent answer.

download research papers using doi

Popular answers (1)

download research papers using doi

Top contributors to discussions in this field

Jassim Mohsen Abed

  • University of Basrah

Pierluigi Traverso

  • Italian National Research Council

Adjroud Ounassa

  • ICAR Research Complex for Eastern Region

Get help with your research

Join ResearchGate to ask questions, get input, and advance your work.

All Answers (240)

download research papers using doi

  • Neurology 31(5):625-8
  • DOI: 10.1212/WNL.31.5.625
  • Source PubMed

download research papers using doi

  • https://www.abiebdragx.me/2019/03/Scopus-saluran-index-jurnal-terpercaya.html
  • https://www.abiebdragx.me/2017/12/gratis-cara-download-jurnal-berbayar-dengan-mudah-dan-gratis.html

download research papers using doi

  • COPY UR DOI THEN PASTE THE LINK https://sci-hub.tw/
  • AFTER OPENIG WEB OR HOME PAGR DIRECT PASTE UR DOI

download research papers using doi

Similar questions and discussions

  • Asked 13 April 2020

Hasna Elalaoui Elabdallaoui

  • Asked 6 November 2018

Preethanuj Preethalayam

  • Asked 28 June 2023

Ibrahim Khan

  • Asked 19 November 2020

Bilal Ahmad

  • Asked 4 July 2023

Bidhan Pandit

  • Asked 13 June 2023

Faheem Anwar

  • Asked 18 April 2022

Chan Kin Onn

  • Asked 4 August 2021

Mohd Shariq

Related Publications

John Slay

  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up
  • Interesting
  • Scholarships
  • UGC-CARE Journals

Download Research Papers for Free: Legal and Ethical Methods

14 Legal Ways to Download Research Papers for Free: The Ultimate Guide

Dr. Somasundaram R

Are you a student, researcher, or curious individual looking to access scholarly articles without breaking the bank? You’re in luck! This comprehensive guide will walk you through various legal and ethical methods to Download Research Papers for Free. We’ll cover everything from open-access databases to contacting authors directly, ensuring you have all the tools to fuel your academic pursuits.

Why Access to Research Papers Matters

Before we dive into the methods, let’s quickly address why free access to research papers is so crucial:

  • Advancing knowledge: Open access to research promotes the spread of ideas and accelerates scientific progress.
  • Equalizing opportunities: Free access levels the playing field for researchers and students worldwide, regardless of their financial resources.
  • Encouraging collaboration: When research is freely available, it’s easier for scientists to build upon each other’s work and collaborate across institutions.

Now, let’s explore the various ways you can legally and ethically obtain research papers without spending a dime.

10 Legal Ways to Download Research Papers for Free: The Ultimate Guide

1. leverage open access databases.

Open-access databases are treasure troves of freely available scholarly articles. Here are some of the best options:

  • PubMed Central (PMC): A free full-text archive of biomedical and life sciences journal literature.
  • Directory of Open Access Journals ( DOAJ ): A community-curated online directory that indexes high-quality, open-access, peer-reviewed journals.
  • arXiv: A repository of electronic preprints for physics, mathematics, computer science, and related fields.
  • CORE: The world’s largest collection of open-access research papers.

Pro tip: Many of these databases offer email alerts for new papers in your area of interest, helping you stay up-to-date with the latest research.

2. Utilize Academic Search Engines

Specialized academic search engines can help you find both open-access and potentially accessible papers:

  • Google Scholar: The most popular academic search engine, with features like cited by and related articles.
  • Microsoft Academic: A free public web search engine for academic publications and literature.
  • Semantic Scholar: An AI-powered research tool for scientific literature.

These search engines often provide direct links to free full-text versions when available or point you towards institutional repositories.

3. Explore Institutional Repositories

Many universities and research institutions maintain their own repositories of scholarly work produced by their faculty and students. These repositories often make papers freely available to the public. Try searching for “[University Name] repository” to find these goldmines of information.

4. Check Author’s Websites and Social Media

Many researchers maintain personal websites or profiles on academic social networks where they share their work. Try searching for the author’s name followed by their institution or area of expertise. Platforms to check include:

  • ResearchGate
  • Academia.edu

5. Contact the Authors Directly

If you can’t find a free version of a paper, don’t hesitate to reach out to the authors. Most researchers are happy to share their work and may send you a copy of their paper. Look for the corresponding author’s email address in the paper’s abstract or contact information.

6. Use Browser Extensions

Several browser extensions can help you find free versions of paywalled articles:

  • Unpaywall: A legal and simple tool that searches for free versions of scholarly articles.
  • Open Access Button: Searches for free, legal copies of research papers.
  • Kopernio: Helps you access PDF versions of scientific articles.

7. Take Advantage of Preprint Servers

Preprint servers host early versions of research papers before they undergo peer review. While these papers should be approached with caution, they can be valuable sources of cutting-edge research:

  • bioRxiv: For life sciences
  • chemRxiv: For chemistry and related fields
  • SocArXiv: For social sciences

8. Utilize Your Library’s Resources

Don’t forget about your local library! Many public and university libraries offer:

  • Access to academic databases
  • Interlibrary loan services
  • Remote access to digital resources

Even if you’re not currently a student, some libraries offer cards to community members that include database access.

9. Explore Sci-Hub Alternatives

While Sci-Hub is popular, it operates in a legal grey area. Instead, consider these alternatives:

  • Open Access Button: A legal tool that helps you request access to research papers.
  • Lazy Scholar: A browser extension that finds free full-text PDF versions of articles.
  • Unpaywall: Another legal alternative that finds open-access versions of articles.

10. Stay Informed About Open Access Initiatives

Keep an eye on developments in the open access movement. Initiatives like Plan S are working to make all publicly funded research freely available. Following these developments can help you stay ahead of the curve in accessing free research.

download research papers for free

Ethical Considerations and Best Practices

While accessing free research papers, it’s crucial to keep these ethical considerations in mind:

  • Respect copyright laws and publisher agreements.
  • Use obtained papers for personal research and educational purposes only.
  • Properly cite all sources in your work.
  • Support open access initiatives when possible.

Accessing research papers for free is not only possible but also increasingly important in our interconnected world. By utilizing the methods outlined in this guide, you can tap into a vast wealth of knowledge without breaking the bank. Remember to always respect copyright laws and support the open access movement to ensure that knowledge remains freely accessible to all.

14 Websites to Download Research Paper for Free – 2024 – Alternative Methods

Collecting and reading relevant research articles to one’s research areas is important for PhD scholars. However, downloading a research paper is one of the most difficult tasks for any research scholar. You must pay for access to high-quality research materials or subscribe to the journal or publication. In this article, ilovephd lists the top 14 websites to download research papers, journals, books, datasets, patents, and conference proceedings for free.

Check the 14 best free websites to download and read research papers listed below:

Sci-Hub is a website link with over 64.5 million academic papers and articles available for direct download. It bypasses publisher paywalls by allowing access through educational institution proxies.  To download papers Sci-Hub  stores papers in its repository, this storage is called Library Genesis (LibGen) or Library Genesis Proxy 2024. It helps researchers to download free articles by simply using the Digital Object Identifier (DOI) of the article.

Scihub

Visit: Working Sci-Hub Proxy Links – 2024

2. Z-Library

The Z-Library clones Library Genesis, a shadow library project. Z-Library facilitates file sharing of scholarly journal articles, academic texts, and general-interest books (including some copyrighted materials). While most of its books come from Library Genesis, further expanding the collection, users can also directly upload content to the site. This user-contributed content helps to make literature even more widely available. Additionally, individuals can donate to the website’s repository, furthering their mission of free access.

Z-Library claims to have a massive collection, boasting more than 10,139,382 Books books and 84,837,646 Articles articles as of April 25, 2024. According to the project’s page for academic publications (at booksc.org), it aspires to be “the world’s largest e-book library” as well as “the world’s largest scientific papers repository.” Interestingly, Z-Library also describes itself as a donation-based non-profit organization.

Z-Library

Visit Z-Library – You can Download 70,000,000+ scientific articles for free

3. Library Genesis

The Library Genesis aggregator is a community aiming to collect and catalog item descriptions for the most scientific, scientific, and technical directions, as well as file metadata. In addition to the descriptions, the aggregator contains only links to third-party resources hosted by users. All information posted on the website is collected from publicly available public Internet resources and is intended solely for informational purposes.

Library Genesis

Visit: libgen.li

4. Unpaywall – Free Research Paper Download

Unpaywall harvests Open Access content from over 50,000 publishers and repositories, and makes it easy to find, track, and use. It is integrated into thousands of library systems, search platforms, and other information products worldwide. If you’re involved in scholarly communication, there’s a good chance you’ve already used Unpaywall data.

Unpaywall is run by OurResearch, a nonprofit dedicated to making scholarships more accessible to everyone. Open is our passion. So it’s only natural our source code is open, too.

download research papers using doi

Visit: unpaywall.org

5. GetTheResearch.org

GetTheResearch.org is an  Artificial Intelligence(AI)  powered search engine for searching and understanding  scientific articles  for researchers and scientists. It was developed as a part of the  Unpaywall  project. Unpaywall is a database of 23,329,737 free scholarly Open Access(OA) articles from over 50,000 publishers and repositories, and make it easy to find, track, and use.

Gettheresearch.org ilovephd

Visit: Find and Understand 25 Million Peer-Reviewed Research Papers for Free

6. Directory of Open Access Journals (DOAJ)

DOAJ (Directory of Open Access Journals) was launched in 2003 with 300 open-access journals. Today, this independent index contains almost 17,500 peer-reviewed, open-access journals covering all areas of science, technology, medicine, social sciences, arts, and humanities. Open-access journals from all countries and in all languages are accepted for indexing.

DOAJ is financially supported by many libraries, publishers, and other like-minded organizations. Supporting DOAJ demonstrates a firm commitment to open access and the infrastructure that supports it.

Directory of Open Access Journals

Visit: doaj.org

7. Researcher

The Researcher is a free journal-finding mobile application that helps you to read new journal papers every day that are relevant to your research. It is the most popular mobile application used by more than 3 million scientists and researchers to keep themselves updated with the latest academic literature.

Researcher

Visit: 10 Best Apps for Graduate Students 

8. Science Open

ScienceOpen  is a discovery platform with interactive features for scholars to enhance their research in the open, make an impact, and receive credit for it. It provides context-building services for publishers, to bring researchers closer to the content than ever before. These advanced search and discovery functions, combined with post-publication peer review, recommendation, social sharing, and collection-building features make  ScienceOpen  the only research platform you’ll ever need.

download research papers using doi

Visit: scienceopen.com

OA.mg is a search engine for academic papers. Whether you are looking for a specific paper, or for research from a field, or all of an author’s works – OA.mg is the place to find it.

oa mg

Visit: oa.mg

10. Internet Archive Scholar

Internet Archive Scholar (IAS) is a full-text search index that includes over 25 million research articles and other scholarly documents preserved in the Internet Archive. The collection spans from digitized copies of eighteenth-century journals through the latest Open Access conference proceedings and pre-prints crawled from the World Wide Web.

Internet-Archive-Scholar

Visit: Sci hub Alternative – Internet Archive Scholar

11. Citationsy Archives

Citationsy was founded in 2017 after the reference manager Cenk was using at the time, RefMe, was shut down. It was immediately obvious that the reason people loved RefMe — a clean interface, speed, no ads, and simplicity of use — did not apply to CiteThisForMe. It turned out to be easier than anticipated to get a rough prototype up.

citationsy

Visit: citationsy.com

CORE is the world’s largest aggregator of open-access research papers from repositories and journals. It is a not-for-profit service dedicated to the open-access mission. We serve the global network of repositories and journals by increasing the discoverability and reuse of open-access content.

It provides solutions for content management, discovery, and scalable machine access to research. Our services support a wide range of stakeholders, specifically researchers, the general public, academic institutions, developers, funders, and companies from a diverse range of sectors including but not limited to innovators, AI technology companies, digital library solutions, and pharma.

CORE

Visit: core.ac.uk

13. Dimensions

The database called “Dimensions” covers millions of research publications connected by more than 1.6 billion citations, supporting grants, datasets, clinical trials, patents, and policy documents.

Dimensions is the most comprehensive research grants database that links grants to millions of resulting publications, clinical trials, and patents. It

provides up-to-the-minute online attention data via Altmetric, showing you how often publications and clinical trials are discussed around the world. 226m Altmetric mentions with 17m links to publications.

Dimensions include datasets from repositories such as Figshare, Dryad, Zenodo, Pangaea, and many more. It hosts millions of patents with links to other citing patents as well as to publications and supporting grants.

Dimensions

Visit: dimensions.ai

14. PaperPanda – Download Research Papers for Free

PaperPanda is a Chrome extension that uses some clever logic and the Panda’s detective skills to find you the research paper PDFs you need. Essentially, when you activate PaperPanda it finds the DOI of the paper from the current page, and then goes and searches for it. It starts by querying various Open Access repositories like OpenAccessButton, OaDoi, SemanticScholar, Core, ArXiV , and the Internet Archive. You can also set your university library’s domain in the settings (this feature is in the works and coming soon). PaperPanda will then automatically search for the paper through your library. You can also set a different custom domain in the settings.

Paperpanda

Visit: PaperPanda

I hope this article will help you to know some of the best websites to download research papers and journals for free. By utilizing open-access databases, free search tools, and potentially even your local university library, you can access a wealth of valuable scholarly information without infringing on a copyright. Remember, ethical practices in research paper downloading are important, so always prioritize legal access to materials whenever possible. Happy researching!

Scientific Research Paper for Download

  • download paid books for free
  • download research papers for free
  • download research papers free
  • download scientific article for free
  • Free Datasets download
  • how to download research paper

Dr. Somasundaram R

10 Trending AI Tools for Dynamic Graph Visualization

List of phd and postdoc fellowships in india 2024, top 100 journal publications in the world 2024.

hi im zara,student of art. could you please tell me how i can download the paper and books about painting, sewing,sustainable fashion,graphic and so on. thank a lot

thanks for the informative reports.

warm regards

Good, Keep it up!

LEAVE A REPLY Cancel reply

Most popular, indo-german research collaboration: joint call for proposals 2024, india-uk joint call for proposal: pioneering telecommunications research (dst-epsrc), list of laboratories and centers under drdo, top 10 uk universities welcoming commonwealth professional fellows, 5 free gptzero alternatives that actually work in 2024: unmask ai content now, 10 mind-blowing ai projects transforming medical imaging, 150+ innovative generative ai project ideas: transforming industries and advancing technology, best for you, 24 best online plagiarism checker free – 2024, what is phd, popular posts, 480 ugc care list of journals – science – 2024, reviewer three: unveiling the world of peer review, popular category.

  • POSTDOC 317
  • Interesting 258
  • Journals 234
  • Fellowship 133
  • Research Methodology 102
  • All Scopus Indexed Journals 92

Mail Subscription

ilovephd_logo

iLovePhD is a research education website to know updated research-related information. It helps researchers to find top journals for publishing research articles and get an easy manual for research tools. The main aim of this website is to help Ph.D. scholars who are working in various domains to get more valuable ideas to carry out their research. Learn the current groundbreaking research activities around the world, love the process of getting a Ph.D.

Contact us: [email protected]

Google News

Copyright © 2024 iLovePhD. All rights reserved

  • Artificial intelligence

download research papers using doi

Unfortunately we don't fully support your browser. If you have the option to, please upgrade to a newer version or use Mozilla Firefox , Microsoft Edge , Google Chrome , or Safari 14 or newer. If you are unable to, and need support, please send us your feedback .

We'd appreciate your feedback. Tell us what you think! opens in new tab/window

Sharing research data

As a researcher, you are increasingly encouraged, or even mandated, to make your research data available, accessible, discoverable and usable.

Sharing research data is something we are passionate about too, so we’ve created this short video and written guide to help you get started.

Illustration of two people mining on a globe

Research Data

What is research data.

While the definition often differs per field, generally, research data refers to the results of observations or experiments that validate your research findings. These span a range of useful materials associated with your research project, including:

Raw or processed data files

Research data  does not  include text in manuscript or final published article form, or data or other materials submitted and published as part of a journal article.

Why should I share my research data?

There are so many good reasons. We’ve listed just a few:

How you benefit

You get credit for the work you've done

Leads to more citations! 1

Can boost your number of publications

Increases your exposure and may lead to new collaborations

What it means for the research community

It's easy to reuse and reinterpret your data

Duplication of experiments can be avoided

New insights can be gained, sparking new lines of inquiry

Empowers replication

And society at large…

Greater transparency boosts public faith in research

Can play a role in guiding government policy

Improves access to research for those outside health and academia

Benefits the public purse as funding of repeat work is reduced

How do I share my research data?

The good news is it’s easy.

Yet to submit your research article?  There are a number of options available. These may vary depending on the journal you have chosen, so be sure to read the  Research Data  section in its  Guide for Authors  before you begin.

Already published your research article?  No problem – it’s never too late to share the research data associated with it.

Two of the most popular data sharing routes are:

Publishing a research elements article

These brief, peer-reviewed articles complement full research papers and are an easy way to receive proper credit and recognition for the work you have done. Research elements are research outputs that have come about as a result of following the research cycle – this includes things like data, methods and protocols, software, hardware and more.

Publish icon

You can publish research elements articles in several different Elsevier journals, including  our suite of dedicated Research Elements journals . They are easy to submit, are subject to a peer review process, receive a DOI and are fully citable. They also make your work more sharable, discoverable, comprehensible, reusable and reproducible.

The accompanying raw data can still be placed in a repository of your choice (see below).

Uploading your data to a repository like Mendeley Data

Mendeley Data is a certified, free-to-use repository that hosts open data from all disciplines, whatever its format (e.g. raw and processed data, tables, codes and software). With many Elsevier journals, it’s possible to upload and store your data to Mendeley Data during the manuscript submission process. You can also upload your data directly to the repository. In each case, your data will receive a DOI, making it independently citable and it can be linked to any associated article on ScienceDirect, making it easy for readers to find and reuse.

store data illustration

View an article featuring Mendeley data opens in new tab/window  (just select the  Research Data  link in the left-hand bar or scroll down the page).

What if I can’t submit my research data?

Data statements offer transparency.

We understand that there are times when the data is simply not available to post or there are good reasons why it shouldn’t be shared.  A number of Elsevier journals encourage authors to submit a data statement alongside their manuscript. This statement allows you to clearly explain the data you’ve used in the article and the reasons why it might not be available.  The statement will appear with the article on ScienceDirect. 

declare icon

View a sample data statement opens in new tab/window  (just select the  Research Data  link in the left-hand bar or scroll down the page).

Showcasing your research data on ScienceDirect

We have 3 top tips to help you maximize the impact of your data in your article on ScienceDirect.

Link with data repositories

You can create bidirectional links between any data repositories you’ve used to store your data and your online article. If you’ve published a data article, you can link to that too.

link icon

Enrich with interactive data visualizations

The days of being confined to static visuals are over. Our in-article interactive viewers let readers delve into the data with helpful functions such as zoom, configurable display options and full screen mode.

Enrich icon

Cite your research data

Get credit for your work by citing your research data in your article and adding a data reference to the reference list. This ensures you are recognized for the data you shared and/or used in your research. Read the  References  section in your chosen journal’s  Guide for Authors  for more information.

citation icon

Ready to get started?

If you have yet to publish your research paper, the first step is to find the right journal for your submission and read the  Guide for Authors .

Find a journal by matching paper title and abstract of your manuscript in Elsevier's  JournalFinder opens in new tab/window

Find journal by title opens in new tab/window

Already published? Just view the options for sharing your research data above.

1 Several studies have now shown that making data available for an article increases article citations.

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest Content
  • For authors
  • BMJ Journals

You are here

  • Volume 10, Issue 3
  • Evening regular activity breaks extend subsequent free-living sleep time in healthy adults: a randomised crossover trial
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • http://orcid.org/0000-0002-4594-8730 Jennifer T Gale 1 ,
  • Jillian J Haszard 2 ,
  • Dorothy L Wei 1 ,
  • Rachael W Taylor 3 ,
  • Meredith C Peddie 1
  • 1 Department of Human Nutrition , University of Otago , Dunedin , New Zealand
  • 2 Biostatistics Centre , University of Otago , Dunedin , New Zealand
  • 3 Department of Medicine , University of Otago , Dunedin , New Zealand
  • Correspondence to Jennifer T Gale; jen.gale{at}postgrad.otago.ac.nz

Objective To determine if performing regular 3-min bouts of resistance exercise spread over 4 hours in an evening will impact subsequent sleep quantity and quality, sedentary time and physical activity compared with prolonged uninterrupted sitting.

Methods In this randomised crossover trial, participants each completed two 4-hour interventions commencing at approximately 17:00 hours: (1) prolonged sitting and (2) sitting interrupted with 3 min of bodyweight resistance exercise activity breaks every 30 min. On completion, participants returned to a free-living setting. This paper reports secondary outcomes relating to sleep quality and quantity, physical activity and sedentary time which were assessed using wrist-worn ActiGraph GT3+ accelerometers paired with a sleep and wear time diary.

Results A total of 28 participants (women, n=20), age 25.6±5.6 years, body mass index 29.5±6.7 kg/m 2 (mean±SD) provided data for this analysis. Compared with prolonged sitting, regular activity breaks increased mean sleep period time and time spent asleep by 29.3 min (95% CI: 1.3 to 57.2, p=0.040) and 27.7 min (95% CI: 2.3 to 52.4, p=0.033), respectively, on the night of the intervention. There was no significant effect on mean sleep efficiency (mean: 0.2%, 95% CI: −2.0 to 2.4, p=0.857), wake after sleep onset (1.0 min, 95% CI: −9.6 to 11.7, p=0.849) and number of awakenings (0.8, 95% CI: −1.8 to 3.3, p=0.550). Subsequent 24-hour and 48-hour physical activity patterns were not significantly different.

Conclusions Performing bodyweight resistance exercise activity breaks in the evening has the potential to improve sleep period and total sleep time and does not disrupt other aspects of sleep quality or subsequent 24-hour physical activity. Future research should explore the longer-term impact of evening activity breaks on sleep.

Trial registration number Australian New Zealand Clinical Trials Registry (ACTRN12621000250831).

  • Sitting time

Data availability statement

Data are available upon reasonable request. Data described in the manuscript will be made available upon reasonable request to the corresponding author.

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:  http://creativecommons.org/licenses/by-nc/4.0/ .

https://doi.org/10.1136/bmjsem-2023-001774

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

WHAT IS ALREADY KNOWN ON THIS TOPIC

Evidence indicates that evening exercise sessions have no disruptive effects and, in some cases, positive impacts on elements of sleep quality, however, current recommendations discourage exercise prior to bedtime. The regular activity breaks protocol has been shown to improve postprandial metabolism, however, the impact on subsequent sleep is unknown.

WHAT THIS STUDY ADDS

Interrupting evening sedentary time with 3 min of light-intensity to moderate-intensity bodyweight resistance exercises every 30 min extends subsequent free-living time spent asleep by 27 min and has no disruptive effects on other elements of sleep and 24-hour physical activity patterns in healthy adults.

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

Sleep hygiene recommendations should be reviewed to better reflect the current pool of evidence. Regularly interrupting prolonged sitting with short bouts of activity breaks is a promising intervention that may improve cardiometabolic health through multiple mechanisms (postprandial metabolism and sleep).

Introduction

Insufficient sleep can adversely affect diet 1 and has been associated with an increased risk of cardiometabolic diseases including incident coronary heart disease 2–4 and type 2 diabetes. 3 4 Other components of sleep are also important; difficulty with initiating and maintaining sleep also increases the risk of type 2 diabetes, 5 and disrupted sleep has been associated with a greater risk of coronary heart disease 6 and other cardiometabolic risk factors such as elevated blood pressure and blood lipid levels. 7

Although higher levels of daytime physical activity generally promote better sleep, current sleep hygiene recommendations discourage high-intensity exercise prior to bed because exercise-induced elevations in body temperature and heart rate can result in poorer sleep quality. 8 However, these recommendations do not appear widely supported by current evidence, with many experimental studies reporting no significant negative effects of late-night exercise on sleep quality 9 10 and some reporting favourable effects. 11–13 It may also be important to consider if changing activity patterns in the evening impacts overall physical activity and 24-hour activity patterns with existing data limited to children. 14 To date, no study appears to have investigated the impact of breaking up sedentary time in the evening by performing short bouts of light-intensity to moderate-intensity resistance exercises on subsequent sleep and physical activity patterns.

The evening period is a prime time to target behaviours that influence cardiometabolic health. Adults accrue the longest periods of uninterrupted sitting 15–17 and consume almost half their daily energy intake during this time. 18 Insulin sensitivity is also diminished in the evening 19 and together, these factors promote elevated postprandial responses, which can be detrimental to cardiometabolic health over time. 20 This activity breaks protocol, which interrupts evening prolonged sitting with 3 min of simple resistance exercises every 30 min, has shown to positively affect postprandial metabolism. 21 However, how this protocol, which increases the amount of activity participants are doing in the hours immediately preceding bedtime, influences subsequent sleep is unknown.

Therefore, the aims of this study were to determine the effect of performing regular resistance exercise breaks compared with prolonged sitting in the evening over 4 hours in a laboratory setting on the secondary outcomes sleep quantity and quality (sleep period time, efficiency and wake after sleep onset), sedentary time and physical activity over the subsequent free-living 48 hours.

Study design

This study was a randomised crossover trial. This manuscript focuses on secondary outcomes related to sleep quantity and quality and patterns of physical activity and sedentary time. The primary outcome has been published previously, see Gale et al . 21 For further details, see attached the study protocol in online supplemental file 1 .

Supplemental material

Participants.

This study was conducted in Dunedin, New Zealand. Thirty participants aged 18–40 years were recruited by word of mouth. A sample size of 30 participants was estimated to provide 80% power (5% significance) to detect a difference of 0.4 SD in glucose total area under the curve (which was the primary outcome of this study). Eligible participants were: non-smokers, not taking medications or supplements known to impact glucose or triglyceride metabolism, able to speak and understand English, without intolerances or allergies to gluten or dairy (these components were present in the test meals) and those who self-reported habitual sedentary time of more than 5 hours (work) and 2 hours (evening) per day. Participants were asked to obtain medical clearance if their responses to the Physical Activity Readiness Questionnaire indicated that physical activity may not be appropriate (n=1). Participants from across the body mass index categories (minimum 18.5 kg/m 2 , no upper limit) were recruited to ensure representation from all groups given the relationship between obesity and glycaemic control. All participants provided written informed consent.

Preliminary measures

Participants attended an introductory session at the University of Otago to confirm eligibility for enrolment. Blood pressure was measured using an automated sphygmomanometer (OMRON HEM-907; Omron Healthcare; Kyoto, Japan) and a correctly sized cuff. Participants were excluded if their systolic or diastolic blood pressure readings were greater than 130 mm Hg and 90 mm Hg, respectively. Standard height and weight were measured in duplicate following standard procedures. Experimental protocols were discussed, and participants watched a video that demonstrated the exercises. Participants practiced the required exercises under supervision from the study research assistant (Registered Dietitian) who was instructed on how to observe and correct technique by a member of the research team who has a degree in Exercise Science (MCP). On completion of primary measurements, participants were fitted with an ActiGraph GT3X+ (ActiGraph, Pensacola, Florida, USA) accelerometer to be worn continuously (24 hours per day) on their non-dominant wrist for seven consecutive days to capture habitual physical activity and sleep patterns. Participants were provided with a wear time diary to record non-wear time, what times they retired to bed, attempted to sleep and woke up. Participants were also asked to record any physical activity performed while not wearing the accelerometer (eg, swimming or contact sport) or to record activities known to be inaccurately identified by the accelerometer (eg, stationary cycling, certain resistance-based exercises or yoga).

Randomisation

Participants were randomised to complete the two experimental conditions in one of two possible orders ( figure 1 ), stratified by weight status. The randomisation sequence was generated by MCP prior to recruitment using Stata (V.16; StataCorp, College Station, Texas, USA) and concealed electronically. The randomisation sequence was revealed and assigned on the afternoon prior to each participant beginning their first experimental condition. Participants were informed of their allocated sequence on arrival.

  • Download figure
  • Open in new tab
  • Download powerpoint

CONSORT study flow chart. BMI, body mass index; CONSORT, Consolidated Standards of Reporting Trials.

Pre-intervention standardisation protocols

To minimise diet-induced variability on experimental days, participants were provided with a standardised breakfast, morning tea, lunch and additional snacks to be consumed before 14:00 hours on each experimental day. A detailed summary of the standardised meal protocol is reported elsewhere. 21 Participants were fitted with an ActiGraph GT3X+ accelerometer for continuous wear on their non-dominant wrist from the morning of the intervention day to 48 hours after the intervention. In the 24 hours prior to the first experimental condition, participants were asked to avoid all moderate-intensity to vigorous-intensity physical activity. Participants verbally self-reported compliance with all pre-intervention protocols before each experimental session.

Intervention protocol

Details of the laboratory intervention sessions have been described previously. 21 Each participant completed two 4-hour sessions, on the same day of the week, from 17:00–17:30 to 21:30–22:00 hours, separated by a minimum 6-day washout to eliminate carry-over effects (median 6 days, IQR 6–12 days). The intervention sessions were conducted on either Tuesday or Thursday evenings, to ensure the next day was a ‘typical’ weekday, rather than a weekend day. In the prolonged sitting condition, participants remained seated for the duration of the session. The regular activity breaks condition was identical, except participants interrupted sitting with 3 min of simple resistance exercises every 30 min. Each break involved three exercises (chair squats, calf raises and standing knee raises with straight leg hip extensions) for 20 s each over three rounds. Participants performed exercises in time with a video recording of a person performing the exercises in a time standardised manner, and included reminders about form and a timer. These simple, body weight resistance exercises were chosen as the mode of activity breaks for this study as they do not require equipment, can be performed on the spot and have been used previously. 22 During the first session, participants were permitted to get up and use the bathroom as required and bathroom breaks were replicated during the subsequent session. While seated participants were able to watch television, read or work on a portable device during both conditions. Two standardised meals were provided during each condition at baseline and 2 hours. Sessions were supervised by two members of the research team. All participants completed every activity break, and no adverse events were reported during the breaks. Following the sessions, participants returned to their normal free-living environment with no further standardisation.

Physical activity and sleep data processing

For both periods of physical activity assessment (pre-trial habitual physical activity and the assessment of activity immediately prior to, during and after each intervention) time-stamped activity data were downloaded using ActiLife software (ActiLife V.6.13.4), saved in 15 s epoch and imported into Stata. Self-reported sleep and wake times were entered manually into ActiLife to constrain the Cole-Kripke algorithm 23 that determined sleep period time (time between self-reported time attempted sleep and the wake time), wake after sleep onset (WASO (minutes spent awake between sleep onset determined by algorithm and end of sleep)), total sleep time (amount of time spent sleeping during sleep period time for example, sleep period time minus WASO), number of awakenings and sleep efficiency (how consolidated the sleep was). The intensity and duration of activity performed during self-reported non-wear time (eg, contact sport) were identified and manually overwritten in Stata. Sedentary time was classified as <2860 counts/min, with total physical activity represented by over this cut point (ie, ≥2860), which therefore combines light, moderate and vigorous activity. 24 Valid wear time was classified as wear time ≥10 hours during waking hours.

Physical activity and sleep data were separated into two distinct time periods: intervention and post-intervention ( online supplemental figure 1 ). The post-intervention period was defined as the 48-hour period following the end time of the experimental condition although each nocturnal period (defined based on self-reported attempted sleep and wake times) during the post-intervention period was analysed separately.

Statistical analysis

Thirty participants completed the study, however, two participants with missing data were excluded (n=1: accelerometer malfunction, n=1: removed accelerometer overnight). Twenty-eight participants were included in the analyses. To investigate differences between conditions, mixed-effects regression models were used with sleep and activity variables as outcomes, intervention condition as the independent variable and participant as a random effect. Mean differences, 95% CIs and p values were calculated. Residuals of models were plotted and visually assessed for homoskedasticity and normality. A p value of <0.05 was considered statistically significant. All analyses were carried out in Stata V.17.0 (StataCorp LLC, College Station, Texas, USA).

Time spent in physical activity and sedentary behaviour were reported in (1) absolute minutes and (2) proportions of the waking day. Both are reported because if sleep period time is different between conditions, then absolute minutes in activity and sedentary time would necessarily be different due to the 24-hour constraint of the day. In this situation, the difference in activity or sedentary time may not represent the effect of the intervention directly, but rather represent displacement of other time because of a change in sleep period time. Proportions, however, describe differences in time-use composition of the waking day, independent of sleep period time. Both are informative.

The first 24 hours was analysed as the primary time period to assess the acute effects of regular activity breaks in the evening. The full post-intervention period (48 hours) was analysed as the secondary time period to determine if any acute effects were apparent over 2 days.

As an increase in these sleep and activity variables can be either health promoting (sleep period time, total sleep time, sleep efficiency and physical activity) or not health promoting (WASO, number of night awakenings and sedentary time), a forest plot was created so that direction and strength of effects could be visually assessed more easily. For this, all mean differences and 95% CIs were standardised to be in units of SD.

Equity, diversity and inclusion statement

Our research and author team consist of women, junior, mid-career and senior researchers from different disciplines (Human Nutrition & Dietetics, Biostatistics Sleep and Exercise Sciences); however, all members are based at one University. We acknowledge that our study population is mostly well-educated, white women. We did not purposefully recruit marginalised communities, nor did we investigate the effects of merorganisation on the observed responses.

This study was commenced in March 2021 and ended in October 2021 when the intended sample size was reached (n=30). Participants were mostly women, of New Zealand European ethnicity, and 19–39 years of age ( table 1 ). Based on habitual accelerometry prior to intervention, participants spent 7 hours 47 min (SD 1 hour 13 min) asleep, 10 hours 31 min (1 hour 27 min) sedentary and 4 hours 55 min (1 hour 20 min) engaged in total (light and moderate-to-vigorous) physical activity on average. Three-quarters of participants had an optimal sleep duration, 21% were short sleepers (<7 hours) and 4% were long sleepers (>9 hours).

  • View inline

Participant characteristics* (n=28)

In the first nocturnal period, regular activity breaks increased sleep period time (the quantity of time between sleep onset and end of sleep) by 29.3 min (95% CI: 1.3 to 57.2, p=0.040, table 2 ) compared with prolonged sitting. There were no significant differences in sleep efficiency, WASO and number of awakenings. Total sleep time (amount of time a person spends sleeping during sleep period) was 27.7 min longer (95% CI: 2.3 to 52.4, p=0.033) following the regular activity breaks intervention (7 hours 12 min, SD 48 min) compared with prolonged sitting (6 hours and 45 min, SD 82 min) ( table 2 ). Time that sleep was attempted did not significantly differ between conditions (11:56 pm for prolonged sitting and 11:58 pm for regular activity breaks) whereas mean wake times the following morning were different (7:35 am for prolonged sitting, 8:06 am for regular activity breaks ( online supplemental table 1 ).

The effect of regular activity breaks and prolonged sitting in the evening on sleep, physical activity and sedentary time in the following 24 hours (n=28)

There were no statistically significant differences in activity patterns in the 24 hours following each intervention. However, compared with prolonged sitting, the regular activity breaks intervention resulted in 18 min (95% CI: −50.3 to 13.8, p=0.265) less total physical activity and 1.6% (95% CI: −4.6 to 1.4, p=0.289) less waking time being active, in the 24-hour period following intervention.

Figure 2 shows health-promoting effects of regular activity breaks in the evening with increased sleep period time (effect size 0.42 SD, 95% CI: 0.01 to 0.82) and total sleep time (effect size 0.38 SD, 95% CI: 0.01 to 0.75), as well as a (small, non-significant) decrease in sedentary time. Decreases in sleep efficiency and total physical activity and increases in WASO and number of awakenings were all small (effect size <0.3) and non-significant.

Standardised effect sizes for sleep and physical activity variables for night one following the regular activity breaks intervention compared with prolonged sitting, grouped as either health promoting or not health promoting.

There were no significant differences in measures of sleep or activity over the entire 48 hours following each intervention ( table 3 ). The mean difference in sleep period time for regular activity breaks compared with prolonged sitting in the subsequent 48-hour period was 0 min (−20.5 to 20.5, p>0.999). Mean bedtime, sleep onset and wake times for each nocturnal period by intervention can be found in online supplemental eTable 1 .

The effect of regular activity breaks and prolonged sitting in the evening on sleep, physical activity and sedentary time in the following 48 hours (n=28)

This study appears to be the first to explore the effect of evening resistance exercise breaks on subsequent sleep quality and physical activity patterns in healthy adults. Our results indicate that performing regular activity breaks in the evening in a laboratory setting significantly improves free-living sleep period time and total sleep time. Furthermore, this pattern of activity does not appear to disrupt other measured components of free-living sleep quality, nor does it negatively impact subsequent free-living physical activity.

These results add to a growing body of evidence that indicates evening exercise does not disrupt sleep quality, despite current sleep recommendations to the contrary. A meta-analysis of 23 experimental studies reported that, compared with no-exercise, performing one bout of physical activity ending within 4 hours prior to bedtime had no effect on total sleep time, WASO, sleep onset latency and efficiency. 10 Most of these studies used high-intensity cardiovascular physical activity protocols such as cycling or running, usually as a singular bout. Much less research has employed resistance exercise protocols 11 25 26 which may also be a more pragmatic and simple choice for evening activity breaks protocols as individuals can perform the breaks on the spot without interrupting evening activities, such a streaming, thus improving adherence. Our study extends these findings by showing that short bouts of resistance activity performed throughout the evening also do not disrupt sleep quality, and in fact may be beneficial to total sleep time.

While existing research indicates that evening exercise may not adversely impact sleep, the mechanisms by which evening exercise influences sleep quality remain unclear. Increases in core temperature and extended periods of heart rate elevation which can influence melatonin production and increase neurological activity are unlikely with regular activity breaks using resistance exercises 25 27 performed in short bouts, which may explain why there were no differences in sleep quality in the present study. However, the mechanisms behind sleep extension observed in the current study are harder to explain and require further mechanistic data to elucidate.

It is important to note that after completing the prolonged sitting intervention, more than half of our participants (57%) slept <7 hours that night. Therefore, regular bodyweight resistance exercise breaks in the evening have the potential to help individuals meet optimal sleep recommendations and, over the long term, reduce cardiometabolic disease risk. Furthermore, previous research indicates that 30 additional minutes of sleep time have been found to have a positive impact of clinical well-being, thus suggesting our results are clinically relevant, especially so if the benefit could be extended over the long-term. 28

Over the subsequent two nocturnal periods, the mean difference in sleep period time between interventions was 0 min which could indicate some degree of compensation for the additional sleep accrued in the first nocturnal period. Interestingly, as studies often assess compensation for sleep loss rather than sleep extension, 29 explanations of this effect will require further research. Although not statistically significant, there was a reduction in total physical activity of 18 min in the 24 hours following the regular activity breaks intervention compared with prolonged sitting. However, as the proportion of waking time spent in total activity did not change, it seems likely that the additional sleep has, in this case, displaced some total activity.

Research and policy implications

These results provide further evidence that the prevailing guidance to avoid physical activity in the hours before sleep should be removed from sleep hygiene recommendations. To better assess compensatory effects, future studies should assess the impact of performing evening regular activity breaks on sleep quality and activity patterns over a longer period. Additionally, future research should investigate the mechanisms driving evening regular activity breaks induced sleep extension.

Strengths and limitations

Key strengths of the study include its crossover design, which controls for individual variability, and our examination of both the immediate effects of the exercise protocols on sleep and the longer-term examination on activity patterns. Rigorous standardisation protocols were employed for food and physical activity. These strengths elevate the likelihood that the increase in sleep observed can be attributed to the regular activity breaks. Word of mouth recruitment resulted in a sample that was mostly young adult women, which limits the generalisability of the findings. However, participants self-reported spending large parts of their day (at least 5 hours) and evening (at least 2 hours) sedentary. This probably reflects the activity patterns of a wider portion of the population as it is estimated that adults spend more than half of the day engaged in sedentary behaviour. 30 There is limited nationally representative data on sedentary behaviour among New Zealand adults, however, habitual sedentary time of participants in the current study (65% of waking time) is slightly more than larger samples of adults from the USA (58% of monitored wake time). 31 Although participants were not screened for sleep disorders/complaints, objectively measured baseline sleep duration was somewhat similar to national data (collected via self-report) which indicated that ~68% of adults met sleep guidelines (75% in the current study) while 27% were short sleepers (21% in the current study). 32 As sleep was the main outcome, accelerometers were worn on the wrist, rather than on the waist (which is more appropriate for measurement of physical activity). As differentiating between moderate-to-vigorous and light-intensity physical activity can be difficult using existing wrist-worn accelerometry cut points, 33 only total physical activity was reported. As with all laboratory studies, the highly controlled setting may not reflect behaviour in a free-living setting. Thus, further research is required to assess whether activity breaks performed in the evening in a free-living setting can replicate beneficial impacts on sleep as reported here.

Evidence indicates that regular evening activity breaks have a positive effect on acute postprandial glucose and insulin responses in healthy adults. 21 The current study shows that this same protocol also extends subsequent sleep. Future research should explore the acceptability of performing regular evening activity breaks in a free-living setting to inform further intervention development. Additionally, future health initiatives could include tools (eg, a mobile application) to break up evening sedentary time with activity, which hold promise in improving cardiometabolic health via multiple targets (postprandial metabolism and sleep).

Ethics statements

Patient consent for publication.

Not applicable.

Ethics approval

This study involves human participants. The study was approved by the University of Otago Ethics Committee (Health; H20/161, December 2020). Participants gave informed consent to participate in the study before taking part.

Acknowledgments

We thank all study participants and research staff involved in the project.

  • Covassin N ,
  • McCrady-Spitzer SK , et al
  • Hoevenaar-Blom MP ,
  • Spijkerman AMW ,
  • Kromhout D , et al
  • Kritharides L ,
  • Attia J , et al
  • Svensson AK ,
  • Svensson T ,
  • Kitlinski M , et al
  • Cappuccio FP ,
  • Strazzullo P , et al
  • Chandola T ,
  • Ferrie JE ,
  • Perski A , et al
  • Yiallourou SR ,
  • Carrington MJ
  • Hirshkowitz M ,
  • Albert SM , et al
  • Frimpong E ,
  • Mograss M ,
  • Zvionow T , et al
  • Eiholzer R ,
  • Spengler CM
  • Mazzochi JW ,
  • Smith CJ , et al
  • Davenne D ,
  • Lehorgne C , et al
  • Flausino NH ,
  • Da Silva Prado JM ,
  • de Queiroz SS , et al
  • Saunders TJ ,
  • Chaput J-P ,
  • Goldfield GS , et al
  • Skeaff CM ,
  • Perry TL , et al
  • Bellettiere J ,
  • Carlson JA ,
  • Rosenberg D , et al
  • McMillan KA ,
  • Kirk AF , et al
  • Schatzkin A ,
  • Ballard-Barbash R
  • Dalla Man C ,
  • Nandy DK , et al
  • Antoine J ‐M. ,
  • Benton D , et al
  • Haszard JJ , et al
  • Dempsey PC ,
  • Larsen RN ,
  • Sethi P , et al
  • Kripke DF ,
  • Gruen W , et al
  • Montoye AHK ,
  • Clevenger KA ,
  • Pfeiffer KA , et al
  • Miller DJ ,
  • Sargent C ,
  • Roach GD , et al
  • Vincent GE ,
  • Halson S , et al
  • Cepeda MS ,
  • Blacketer C , et al
  • Richard J-B ,
  • Collin O , et al
  • Matthews CE ,
  • Freedson PS , et al
  • Dunstan DW , et al
  • Ministry of Health
  • Taylor FC ,
  • Dempsey PC , et al

Supplementary materials

Supplementary data.

This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

  • Data supplement 1
  • Data supplement 2

Presented at Prior Presentation: Parts of this study were presented in abstract form at the Sleep in Aotearoa 2023 Annual Scientific Meeting, Dunedin, New Zealand, 22–23 June 2023.

Contributors MCP, JJH and RT contributed to conceptualisation. JJH and MCP contributed to methodology. JTG and JJH contributed to formal analysis. MCP, JJH and JTG contributed to investigation. MCP contributed to resources. JJH contributed to data curation. JTG contributed to writing—original draft preparation. JTG, DLW, JJH, RT and MCP contributed to writing—review and editing. JJH and MCP contributed to supervision. JTG contributed to project administration. MCP contributed to funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding This research was funded by the Health Research Council of New Zealand. JTG was supported by the Department of Human Nutrition Doctoral Scholarship, University of Otago.

Disclaimer MCP is the guarantor of this work and, as such, had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Competing interests None declared.

Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

Provenance and peer review Not commissioned; externally peer reviewed.

Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.

Read the full text or download the PDF:

  • Open access
  • Published: 29 July 2024

Predicting hospital length of stay using machine learning on a large open health dataset

  • Raunak Jain 1 ,
  • Mrityunjai Singh 1 ,
  • A. Ravishankar Rao 2 &
  • Rahul Garg 1  

BMC Health Services Research volume  24 , Article number:  860 ( 2024 ) Cite this article

27 Accesses

Metrics details

Governments worldwide are facing growing pressure to increase transparency, as citizens demand greater insight into decision-making processes and public spending. An example is the release of open healthcare data to researchers, as healthcare is one of the top economic sectors. Significant information systems development and computational experimentation are required to extract meaning and value from these datasets. We use a large open health dataset provided by the New York State Statewide Planning and Research Cooperative System (SPARCS) containing 2.3 million de-identified patient records. One of the fields in these records is a patient’s length of stay (LoS) in a hospital, which is crucial in estimating healthcare costs and planning hospital capacity for future needs. Hence it would be very beneficial for hospitals to be able to predict the LoS early. The area of machine learning offers a potential solution, which is the focus of the current paper.

We investigated multiple machine learning techniques including feature engineering, regression, and classification trees to predict the length of stay (LoS) of all the hospital procedures currently available in the dataset. Whereas many researchers focus on LoS prediction for a specific disease, a unique feature of our model is its ability to simultaneously handle 285 diagnosis codes from the Clinical Classification System (CCS). We focused on the interpretability and explainability of input features and the resulting models. We developed separate models for newborns and non-newborns.

The study yields promising results, demonstrating the effectiveness of machine learning in predicting LoS. The best R 2 scores achieved are noteworthy: 0.82 for newborns using linear regression and 0.43 for non-newborns using catboost regression. Focusing on cardiovascular disease refines the predictive capability, achieving an improved R 2 score of 0.62. The models not only demonstrate high performance but also provide understandable insights. For instance, birth-weight is employed for predicting LoS in newborns, while diagnostic-related group classification proves valuable for non-newborns.

Our study showcases the practical utility of machine learning models in predicting LoS during patient admittance. The emphasis on interpretability ensures that the models can be easily comprehended and replicated by other researchers. Healthcare stakeholders, including providers, administrators, and patients, stand to benefit significantly. The findings offer valuable insights for cost estimation and capacity planning, contributing to the overall enhancement of healthcare management and delivery.

Peer Review reports

Introduction

Democratic governments worldwide are placing an increasing importance on transparency, as this leads to better governance, market efficiency, improvement, and acceptance of government policies. This is highlighted by reports from the Organization for Economic Co-operation and Development (OECD) an international organization whose mission it is to shape policies that foster prosperity, equality, opportunity and well-being for all [ 1 ]. Openness and transparency have been recognized as pillars for democracy, and also for fostering sustainable development goals [ 2 ], which is a major focus of the United Nations ( https://sustainabledevelopment.un.org/sdg16 ).

An important government function is to provide for the healthcare needs of its citizens. The U.S. spends about $3.6 trillion a year on healthcare, which represents 18% of its GDP [ 3 ]. Other developed nations spend around 10% of their GDP on healthcare. The percentage of GDP spent on healthcare is rising as populations age. Consequently, research on healthcare expenditure and patient outcomes is crucial to maintain viable national economies. It is advantageous for nations to combine investigations by the private sector, government sector, non-profit agencies, and universities to find the best solutions. A promising path is to make health data open, which allows investigators from all sectors to participate and contribute their expertise. Though there are obvious patient privacy concerns, open health data has been made available by organizations such as New York State Statewide Planning and Research Cooperative System (SPARCS) [ 4 ].

Once the data is made available, it needs to be suitably processed to extract meaning and insights that will help healthcare providers and patients. We favor the creation and use of an open-source analytics system so that the entire research community can benefit from the effort [ 5 , 6 , 7 ]. As a concrete demonstration of the utility of our system and approach, we revealed that there is a growing incidence of mental health issues amongst adolescents in specific counties in New York State [ 8 ]. This has resulted in targeted interventions to address these problems in these communities [ 8 ]. Knowing where the problems lie allows policymakers and funding agencies to direct resources where needed.

Healthcare in the U.S. is largely provided through private insurance companies and it is difficult for patients to reliably understand what their expected healthcare costs are [ 9 , 10 ]. It is ironic that consumers can readily find prices of electronics items, books, clothes etc. online, but cannot find information about healthcare as easily. The availability of healthcare information including costs, incidence of diseases, and the expected length of stay for different procedures will allow consumers and patients to make better and more informed choices. For instance, in the U.S., patients can budget pre-tax contributions to health savings accounts, or decide when to opt for an elective surgery based on the expected duration of that procedure.

To achieve this capability, it is essential to have the underlying data and models that interpret the data. Our goal in this paper is twofold: (a) to demonstrate how to design an analytics system that works with open health data and (b) to apply it to a problem of interest to both healthcare providers and patients. Significant advances have been made recently in the fields of data mining, machine-learning and artificial intelligence, with growing applications in healthcare [ 11 ]. To make our work concrete, we use our machine-learning system to predict the length of stay (LoS) in hospitals given the patient information in the open healthcare data released by New York State SPARCS [ 4 ].

The LoS is an important variable in determining healthcare costs, as costs directly increase for longer stays. The analysis by Jones [ 12 ] shows that the trends in LoS, hospital bed capacity and population growth have to be carefully analyzed for capacity planning and to ensure that adequate healthcare can be provided in the future. With certain health conditions such as cardiovascular disease, the hospital LoS is expected to increase due to the aging of the population in many countries worldwide [ 13 ]. During the COVID-19 pandemic, hospital bed capacity became a critical issue [ 14 ], and many regions in the world experienced a shortage of healthcare resources. Hence it is desirable to have models that can predict the LoS for a variety of diseases from available patient data.

The LoS is usually unknown at the time a patient is admitted. Hence, the objective of our research is to investigate whether we can predict the patient LoS from variables collected at the time of admission. By building a predictive model through machine learning techniques, we demonstrate that it is possible to predict the LoS from data that includes the Clinical Classifications Software (CCS) diagnosis code, severity of illness, and the need for surgery. We investigate several analytics techniques including feature selection, feature encoding, feature engineering, model selection, and model training in order to thoroughly explore the choices that affect eventual model performance. By using a linear regression model, we obtain an R 2 value of 0.42 when we predict the LoS from a set of 23 patient features. The success of our model will be beneficial to healthcare providers and policymakers for capacity planning purposes and to understand how to control healthcare costs. Patients and consumers can also use our model to estimate the LoS for procedures they are undergoing or for planning elective surgeries.

Stone et al. [ 15 ] present a survey of techniques used to predict the LoS, which include statistical and arithmetic methods, intelligent data mining approaches and operations-research based methods. Lequertier et al. [ 16 ] surveyed methods for LoS prediction.

The main gap in the literature is that most methods focus on analyzing trends in the LoS or predicting the LoS only for specific conditions or restrict their analysis to data from specific hospitals. For instance, Sridhar et al. [ 17 ] created a model to predict the LoS for joint replacements in rural hospitals in the state of Montana by using a training set with 127 patients and a test set with 31 patients. In contrast, we have developed our model to predict the LoS for 285 different CCS diagnosis codes, over a set of 2.3 million patients over all hospitals in New York state. The CCS diagnosis code refers to the code used by the Clinical Classifications Software system, which encompasses 285 possible diagnosis and procedure categories [ 18 ]. Since the CCS diagnosis codes are too numerous to list, we give a few examples that we analyzed, including but not limited to abdominal hernia, acute myocardial infarction, acute renal failure, behavioral disorders, bladder cancer, Hodgkins disease, multiple sclerosis, multiple myeloma, schizophrenia, septicemia, and varicose veins. To the best of our knowledge, we are not aware of models that predict the LoS on such a variety of diagnosis codes, with a patient sample greater than 2 million records, and with freely available open data. Hence, our investigation is unique from this point of view.

Sotodeh et al. [ 19 ] developed a Markov model to predict the LoS in intensive care unit patients. Ma et al. [ 20 ] used decision tree methods to predict LoS in 11,206 patients with respiratory disease.

Burn et. al. examined trends in the LoS for patients undergoing hip-replacement and knee-replacement in the U.K. [ 21 ]. Their study demonstrated a steady decline in the LoS from 1997–2012. The purpose of their study was to determine factors that contributed to this decline, and they identified improved surgical techniques such as fast-track arthroplasty. However, they did not develop any machine-learning models to predict the LoS.

Hachesu et al. examined the LoS for cardiac disease patients [ 22 ] and found that blood pressure is an important predictor of LoS. Garcia et al. determined factors influencing the LoS for undergoing treatment for hip fracture [ 23 ]. B. Vekaria et al. analyzed the variability of LoS for COVID-19 patients [ 24 ]. Arjannikov et al. [ 25 ] used positive-unlabeled learning to develop a predictive model for LoS.

Gupta et al. [ 26 ] conducted a meta-analysis of previously published papers on the role of nutrition on the LoS of cancer patients, and found that nutrition status is especially important in predicting LoS for gastronintestinal cancer. Similarly, Almashrafi et al. [ 27 ] performed a meta-analysis of existing literature on cardiac patients and reviewed factors affecting their LoS. However, they did not develop quantitative models in their work. Kalgotra et al. [ 28 ] use recurrent neural networks to build a prediction model for LoS.

Daghistani et al. [ 13 ] developed a machine learning model to predict length of stay for cardiac patients. They used a database of 16,414 patient records and predicted the length of stay into three classes, consisting of short LoS (< 3 days), intermediate LoS ( 3–5 days) and long LoS (> 5 days). They used detailed patient information, including blood test results, blood pressure, and patient history including smoking habits. Such detailed information is not available in the much larger SPARCS dataset that we utilized in our study.

Awad et al. [ 29 ] provide a comprehensive review of various techniques to predict the LoS. Though simple statistical methods have been used in the past, they make assumptions that the LoS is normally distributed, whereas the LoS has an exponential distribution [ 29 ]. Consequently, it is preferable to use techniques that do not make assumptions about the distribution of the data. Candidate techniques include regression, classification and regression trees, random forests, and neural networks. Rather than using statistical parametric techniques that fit parameters to specific statistical distributions, we favor data-driven techniques that apply machine-learning.

In 2020, during the height of the COVID-19 pandemic, the Lancet, a premier medical journal drew widespread rebuke [ 30 , 31 , 32 ] for publishing a paper based on questionable data. Many medical journals published expressions of concern [ 33 , 34 ]. The Lancet itself retracted the questionable paper [ 35 ], which is available at [ 36 ] with the stamp “retracted” placed on all pages. One possible solution to prevent such incidents from occurring is for top medical journals to require authors to make their data available for verification by the scientific community. Patient privacy concerns can be mitigated by de-identifying the records made available, as is already done by the New York State SPARCS effort [ 4 ]. Our methodology and analytics system design will become more relevant in the future, as there is a desire to prevent a repetition of the Lancet debacle. Even before the Lancet incident, there was declining trust amongst the public related to medicine and healthcare policy [ 37 ]. This situation continues today, with multiple factors at play, including biased news reporting in mainstream media [ 38 ]. A desirable solution is to make these fields more transparent, by releasing data to the public and explaining the various decisions in terms that the public can understand. The research in this paper demonstrates how such a solution can be developed.

Requirements

We describe the following three requirements of an ideal system for processing open healthcare data

Utilize open-source platforms to permit easy replicability and reproducibility.

Create interpretable and explainable models.

Demonstrate an understanding of how the input features determine the outcomes of interest.

The first requirement captures the need for research to be easily reproduced by peers in the field. There is growing concern that scientific results are becoming hard for researchers to reproduce [ 39 , 40 , 41 ]. This undermines the validity of the research and ultimately hurts the fields. Baker termed this the “reproducibility crisis”, and performed an analysis of the top factors that lead to irreproducibility of research [ 39 ]. Two of the top factors consist of the unavailability of raw data and code.

The second requirement addresses the need for the machine-learning models to produce explanations of their results. Though deep-learning models are popular today, they have been criticized for functioning as black-boxes, and the precise working of the model is hard to discern. In the field of healthcare, it is more desirable to have models that can be explained easily [ 42 ]. Unless healthcare providers understand how a model works, they will be reluctant to apply it in their practice. For instance, Reyes et al. determined that interpretable Artificial Intelligence systems can be better verified, trusted, and adopted in radiology practice [ 43 ].

The third requirement shows that it is important for relevant patient features to be captured that can be related to the outcomes of interest, such as LoS, total cost, mortality rate etc. Furthermore, healthcare providers should be able to understand the influence of these features on the performance of the model [ 44 ]. This is especially critical when feature engineering methods are used to combine existing features and create new features.

In the subsequent sections, we present our design for a healthcare analytics system that satisfies these requirements. We apply this methodology to the specific problem of predicting the LoS.

We have designed the overall system architecture as shown in Fig.  1 . This system is built to handle any open data source. We have shown the New York SPARCS as one of the data sources for the sake of specificity. Our framework can be applied to data from multiple sources such as the Center for Medicare and Medicaid Services (CMS in the U.S.) as shown in our previous work [ 6 ]. We chose a Python-based framework that utilizes Pandas [ 45 ] and Scikit learn [ 46 ]. Python is currently the most popular programming language for engineering and system design applications [ 47 ].

figure 1

Shows the system architecture. We use Python-based open-source tools such as Pandas and Scikit-Learn to implement the system

In Fig.  2 , we provide a detailed overview of the necessary processing stages. The specific algorithms used in each stage are described in the following sections.

figure 2

Shows the processing stages in our analytics pipeline

Recent research has shown that it is highly desirable for machine learning models used in the healthcare domain to be explainable to healthcare providers and professionals [ 48 ]. Hence, we focused on the interpretability and explainability of input features in our dataset and the models we chose to explore. We restricted our investigation to models that are explainable, including regression models, multinomial logistic regression, random forests, and decision trees. We also developed separate models for newborns and non-newborns.

Brief description of the dataset

During our investigation, we utilized open-health data provided by the New York State SPARCS system. The data we accessed was from the year 2016, which was the most recent year available at the time. This data was provided in the form of a CSV file, containing 2,343,429 rows and 34 columns. Each row contains de-identified in-patient discharge information. The dataset columns contained various types of information. They included geographic descriptors related to the hospital where care was provided, demographic descriptors such as patient race, ethnicity, and age, medical descriptors such as the CCS diagnosis code, APR DRG code, severity of illness, and length of stay. Additionally, payment descriptors were present, which included information about the type of insurance, total charges, and total cost of the procedure.

Detailed descriptions of all the elements in the data can be found in [ 49 ]. The CCS diagnosis code has been described earlier. The term “DRG” stands for Diagnostic Related Group [ 49 ], which is used by the Center for Medicare and Medicaid services in the U.S. for reimbursement purposes [ 50 ].

The data includes all patients who underwent inpatient procedures at all New York State Hospitals [ 51 ]. The payment for the care can come from multiple sources: Department of Corrections, Federal/State/Local/Veterans Administration, Managed Care, Medicare, Medicaid, Miscellaneous, Private Health Insurance, and Self-Pay. The dataset sourced from the New York State SPARCS system, encompassing a wider patient population beyond Medicare/Medicaid, holds greater value compared to datasets exclusively composed of Medicare/Medicaid patients. For instance, Gilmore et al. analyzed only Medicare patients [ 52 ].

We examine the distribution of the LoS in the dataset, as shown in Fig.  3 . We note that the providers of the data have truncated the length of stay to 120 days. This explains the peak we see at the tail of the distribution.

figure 3

Distribution of the length of stay in the dataset

Data pre-processing and cleaning

We identified 36,280 samples, comprising 1.55% of the data where there were missing values. These were discarded for further analysis. We removed samples which have Type of Admission = ‘Unknown’ (0.02% samples). So, the final data set has 2,306,668 samples. ‘Payment Typology 2’, and ‘Payment Typology 3’, have missing values (> = 50% samples), which were replaced by a ‘None’ string.

We note that approximately 10% of the dataset consists of rows representing newborns. We treat this group as a separate category. We found that the ‘Birth Weight’ feature had a zero value for non-newborn samples. Accordingly, to better use the ‘Birth Weight’ feature, we partitioned the data into two classes: newborns and non-newborns. This results in two classes of models, one for newborns and the second for all other patients. We removed the ‘Birth Weight’ feature in the input for the non-newborn samples as its value was zero for those samples.

The column ‘Total Costs’ (and in a similar way, ‘Total Charges’) are usually proportional to the LoS, and it would not be fair to use these variables to predict the LoS. Hence, we removed this column. We found that the columns 'Discharge Year', 'Abortion Edit Indicator'' are redundant for LoS prediction models, and we removed them. We also removed the columns ‘CCS Diagnosis Description’, ‘CCS Procedure Description’, ‘APR DRG Description’, ‘APR MDC Description’, and ‘APR Severity of Illness Description’ as we were given their corresponding numerical codes as features.

Since the focus of this paper is on the prediction of the LoS, we analyzed the distribution of LoS values in the dataset.

We developed regression models using all the LoS values, from 1–120. We also developed classification models where we discretized the LoS into specific bins. Since the distribution of LoS values is not uniform, and is heavily clustered around smaller values, we discretized the LoS into a small number of bins, e.g. 6 to 8 bins.

We utilized 10% of the data as a holdout test-set, which was not seen during the training phase. For the remaining 90% of the data, we used tenfold cross-validation in order to train the model and determine the best parameters to use.

Feature encoding

Many variables in the dataset are categorical, e.g., the variable “APR Severity of Illness Description” has the values in the set [Major, Minor, Moderate, Extreme]. We used distribution-dependent target encoding techniques and one-hot techniques to improve the model performance [ 53 ]. We replaced categorical data with the product of mean LoS and median LoS for a category value. The categorical feature can then better capture the dependence distribution of LoS with the value of the categorical feature.

For the linear regression model [ 54 ], we sampled a set of 6 categorical features, [‘Type of Admission’, ‘Patient Disposition’, ‘APR Severity of Illness Code’, ‘APR Medical Surgical Description’, ‘APR MDC Code’] which we target encoded with the mean of the LoS and the median of the LoS. We then one-hot encoded every feature (all features are categorical) and for each such one-hot encoded feature, created a new feature for each of the features in the sampled set, by replacing the ones in the one-hot encoded feature with the value of the corresponding feature in the sampled set. For example, we one-hot encoded ‘Operating Certificate Number’, and for samples where ‘Operating Certificate Number’ was 3, we created 6 features, each where samples having the value 3 were assigned the target encoded values of the sampled set features, and the other samples were assigned zero. We used such techniques to exploit the linear relation between LoS and each feature.

According to the sklearn documentation [ 55 ], a random forest regressor is “a meta estimator that fits a number of decision tree regressors on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting”. The random forest regressor leverages ensemble learning based on many randomized decision trees to make accurate and robust predictions for regression problems. The averaging of many trees protects against single trees overfitting the training data.

The random forest classifier is also an ensemble learning technique and uses many randomized decision trees to make predictions for classification problems. The 'wisdom of crowds' concept suggests that the decision made by a larger group of people is typically better than an individual. The random forest classifier uses this intuition, and allows each decision tree to make a prediction. Finally, the most popular predicted class is chosen as the overall classification.

For the Random Forest Regressor [ 56 , 57 ] and Random Forest Classifier [ 58 ], we only used a similar distribution dependent target encoding as a random forest classifier/ regressor is unsuitable for sparse one-hot encoded columns.

Multinomial logistic regression is a type of regression analysis that predicts the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables. It allows for more than two discrete outcomes, extending binomial logistic regression for binary classification to models with multiple class membership. For the multinomial logistic regression model [ 59 ], we used only one-hot encoding, and not target encoding, as the target value was categorical.

Finally, we experimented with combinations of target encoding and one-hot encoding. We can either use target encoding, or one-hot encoding, or both. When both encodings are employed, the dimensionality of the data increases to accommodate the one-hot encoded features. For each combination of encodings, we also experimented with different regression models including linear regression and random forest regression.

Feature importance, selection, and feature engineering

We experimented with different feature selection methods. Since the focus of our work is on developing interpretable and explainable models, we used SHAP analysis to determine relevant features.

We examine the importance of different features in the dataset. We used the SHAP value (Shapley Additive Explanations), a popular measure for feature importance [ 60 ]. Intuitively, the SHAP value measures the difference in model predictions when a feature is used versus omitted. It is captured by the following formula.

where \({{\varnothing }}_{i}\) is the SHAP value of feature \(i\) , \(p\) is the prediction by the model, n is the number of features and S is any set of features that does not include the feature \(i\) . The specific model we used for the prediction was the random forest regressor where we target-encoded all features with the product of the mean and the median of the LoS, since most of the features were categorical.

Classification models

One approach to the problem is to bin the LoS into different classes, and train a classifier to predict which class an input sample falls in. We binned the LoS into roughly balanced classes as follows: 1 day, 2 days, 3 days, 4–6 days, > 6 days. This strategy is based on the distribution of the LoS as shown earlier in Figs.  3 and  4 .

figure 4

A density plot of the distribution of the length of stay. The area under the curve is 1. We used a kernel density estimation with a Gaussian kernel [ 61 ] to generate the plot

We used three different classification models, comprising the following:

Multinomial Logistic Regression

Random Forest Classifier

CatBoost classifier [ 62 ].

We used a Multinomial Logistic Regression model [ 59 ] trained and tested using tenfold cross validation to classify the LoS into one of the bins. The multinomial logistic regression model is capable of providing explainable results, which is part of the requirements. We used the feature engineering techniques described in the previous section.

We used a Random Forest Classifier model trained and tested using tenfold cross validation to classify the LoS into one of the bins. We used a maximum depth of 10 so as to get explainable insights into the model.

Finally, we used a CatBoost Classifier model trained and tested using tenfold cross validation to classify the LoS into one of the bins.

Regression models

We used three different regression models with the feature engineering techniques mentioned above ( Feature encoding section). These comprise:

Linear regression

Catboost regression

Random forest regression

The linear regression was implemented using the nn.Linear() function in the open source library PyTorch [ 63 ]. We used the ‘Adam’ optimization algorithm [ 64 ] in mini-batch settings to train the model weights for linear regression.

We investigated CatBoost regression in order to create models with minimal feature sets, whereby models with a low number of input features would provide adequate results. Accordingly, we trained a CatBoost Regressor [ 65 ] in order to determine the relationship between combinations of features and the prediction accuracy as determined by the R 2 correlation score.

The random forest regression was implemented using the function RandomForestRegressor() in scikit learn [ 55 ].

Model performance measures

For the regression models, we used the following metrics to compare the model performance.

The R 2 score and the p -value. We use a significance level of α = 0.05 (5 %) for our statistical tests.  If the p -value is small, i.e. less than α = 0.05, then the R 2 score is statistically significant.

For classifier models, we used the following metrics to compare the model performance.

True positive rate, false negative rate, and F1 score [ 66 ].

We computed the Brier score using Brier’s original calculation in his paper [ 67 ]. In this formulation, for R classes the Brier score B can vary between 0 and R, with 0 being the best score possible.

where \({\widehat{y}}_{i,c}\) is the class probability as per the model and \({I}_{i,c}=1\) if the i th sample belongs to class c and \({I}_{i,c}=0\) if it does not belong to class c .

We used the Delong test [ 68 ] to compare the AUC for different classifiers.

These metrics will allow other researchers to replicate our study and provide benchmarks for future improvements.

In this section we present the results of applying the techniques in the Methods section.

Descriptive statistics

We provide descriptive statistics that help the reader understand the distributions of the variables of interest.

Table 1 summarizes basic statistical properties of the LoS variable.

Figure  5 shows the distribution of the LoS variable for newborns.

figure 5

This figure depicts the distribution of the LoS variable for newborns

Table 2 shows the top 20 APR DRG descriptions based on their frequency of occurrence in the dataset.

Figure  6 shows the distribution of the LoS variable for the top 20 most frequently occurring APR DRG descriptions shown in Table  2 .

figure 6

A 3-d plot showing the distribution of the LoS for the top-20 most frequently occuring APR DRG descriptions. The x-axis (horizontal) depicts the LoS, the y-axis shows the APR DRG codes and the z-axis shows the density or frequency of occurrence of the LoS

We experimented with different encoding schemes for the categorical variables and for each encoding we examined different regression techniques. Our results are shown in Table 3 . We experimented with the three encoding schemes shown in the first column. The last row in the table shows a combination of one-hot encoding and target encoding, where the number of columns in the dataset are increased to accommodate one-hot encoded feature values for categorical variables.

Feature importance, selection and feature engineering

We obtained the SHAP plots using a Random Forest Regressor trained with target-encoded features.

Figures  7  and 8 show the SHAP values plots obtained for the features in the newborn partition of the dataset. We find that the features, “APR DRG Code”, “APR Severity of Illness Code”, “Patient Disposition”, “CCS Procedure Code”, are very useful in predicting the LoS. For instance, high feature values for “APR Severity of Illness Code”, which are encoded by red dots have higher SHAP values than the blue dots, which correspond to low feature values.

figure 7

SHAP Value plot for newborns

figure 8

1-D SHAP plot, in order of decreasing feature importance: top to bottom (for non-newborns)

A similar interpretation can be applied to the features in the non-newborn partition of the dataset. We note that “Operating Certificate Number” is among the top-10 most important features in both the newborn and non-newborn partitions. This finding is discussed in the Discussion section.

From Fig.  9 , we observe that as the severity of illness code increases from 1–4, there is a corresponding increase in the SHAP values.

figure 9

A 2-D plot showing the relationship between SHAP values for one feature, “APR Severity of Illness Code”, and the feature values themselves (non-newborns)

To further understand the relationship between the APR Severity of Illness code and the LoS, we created the plot in Fig.  10 . This shows that the most frequently occurring APR Severity of Illness code is 1 (Minor), and that the most frequently occurring LoS is 2 days. We provide this 2-D projection of the overall distribution of the multi-dimensional data as a way of understanding the relationship between the input features and the target variable, LoS.

figure 10

A density plot showing the relationship between APR Severity of Illness Code and the LoS. The color scale on the right determines the interpretation of colors in the plot. We used a kernel density estimation with a Gaussian kernel [ 61 ] to generate the plot

Similarly, Fig.  11 shows the relationship between the birth weight and the length of stay. The most common length of stay is two days.

figure 11

A density plot showing the distribution of the birth weight values (in grams) versus the LoS. The colorbar on the right shows the interpretation of color values shown in the plot. We used a kernel density estimation with a Gaussian kernel [ 61 ] to generate the plot

Classification

We obtained a classification accuracy of 46.98% using Multinomial Logistic Regression with tenfold cross-validation in the 5-class classification task for non-newborn cases. The confusion matrix in Fig.  12 shows that the highest density of correctly classified samples is in or close to the diagonal region. The regions where out model fails occurs between adjacent classes as can be inferred from the given confusion matrix.

figure 12

Confusion matrix for classification of non-newborns. The number inside each square along the diagonal represents the number of correctly classified samples. The color is coded so lighter colors represent lower numbers

For the newborn cases, we obtained a classification accuracy of 60.08% using Random Forest Classification model with tenfold cross-validation in the 5-class classification task. The confusion matrix in Fig.  13 shows that the majority of data samples lie in or close to the diagonal region. The regions where our model does not do well occurs between adjacent classes as can be inferred from the given confusion matrix,

figure 13

Confusion matrix for classification of newborns. The number inside each square along the diagonal represents the number of correctly classified samples. The color is coded so lighter colors represent lower numbers

The density plot in Fig.  14 shows the relationship between the actual LoS and the predicted LoS. For a LoS of 2 days, the centroid of the predicted LoS cluster is between 2 and 3 days.

figure 14

Shows the density plot of the predicted length of stay versus actual length of stay for the classifier model for non-newborns. We used a kernel density estimation with a Gaussian kernel [ 61 ] to generate the plot

A quantitative depiction of our model errors is shown in Fig.  15 . The values in Fig.  15 are interpreted as follows. Referring to the column for LoS = 2, the top row shows that 51% of the predicted LoS values for an actual stay of 2 days is also 2 days (zero error), and that 23% of the predicted values for LoS equal to 2 days have an error of 1 day and so on. The relatively high values in the top row indicates that the model is performing well, with an error of less than 1 day. There are relatively few instances of errors between 2 and 3 days (typically less than 10% of the values show up in this row). The only exception is for the class corresponding to LoS great than 8 days. The truncation of the data to produce this class results in larger model errors specifically for this class.

figure 15

Shows the distribution of correctly predicted LoS values for each class used in our model. Along the columns, we depict the different classes used in the model, consisting of LoS equal to 1, 2, 3 …8, and more than 8. Each row depicts different errors made in the prediction. For instance, the top row depicts an error of less than or equal to one day between the actual LoS and the predicted Los. The second row from the top depicts an error which is greater than 1 and less than or equal 2 days. And so on for the other rows, for non-newborns

Figures  16 and 17 show the scatter plots for the linear regression models. The exact line represents a line with slope 1, and a perfect model would be one that produced all points lying on this line.

figure 16

Scatter plot showing an instance of a linear regression fit to the data (newborns). The R 2 score is 0.82. The blue line represents an exact fit, where the predicted LoS equals the actual LoS (slope of the line is 1)

figure 17

Scatter plot for linear regression. (non-newborns). The R 2 score is 0.42. The blue line represents an exact fit, where the predicted LoS equals the actual LoS (slope of the line is 1)

Figure  18 shows a density plot depicting the relationship between the predicted length of stay and the actual length of stay.

figure 18

Shows the density plot of the predicted length of stay versus actual length of stay for the classifier model for non-newborns. We used a kernel density estimation with a Gaussian kernel [ 40 ] to generate the plot. The best fit regression line to our predictions is shown in green, whereas the blue line represents the ideal fit (line of slope 1, where actual LoS and predicted LoS are equal)

Most of the existing literature on LoS stay prediction is based on data for specific disease conditions such as cancer or cardiac disease. Hence, in order to understand which CCS diagnosis codes produce good model fits, we produced the plot in Fig.  19 .

figure 19

This figure shows the three CCS diagnosis codes that produced the top three R 2 scores using linear regression. These are 101, 100 and 109. The three CCS Diagnosis codes that produced the lowest R 2 scores are 159, 657, and 659

We provide the following descriptions in Tables  4  and 5 for the 3 CCS Diagnosis Codes in Fig.  19 with the top R 2 Scores using linear regression.

Similarly, the following table shows the 3 CCS Diagnosis Codes in Fig.  19 for the lowest R 2 Scores using linear regression.

Models with minimal feature sets

We trained a CatBoost Regressor [ 65 ] on the complete dataset in order to determine the relationship between combinations of features and the prediction accuracy as determined by the R 2 correlation score. This is shown in Fig.  20

figure 20

The labels for each row on the left show combinations of different input features. A CatBoost regression model was developed using the selected combination of features. The R 2 correlation scores for each model is shown in the bar graph

We can infer from Fig.  20 that only four features (‘'APR MDC Code', 'APR Severity of Illness Code', 'APR DRG Code', 'Patient Disposition') are sufficient for the model to reach very close to its maximum performance. We obtain similar concurring results when using other regression models for the same experiment.

Classification trees

We used a random forest tree approach to generate the trees in Figs.  21 and 22 .

figure 21

A random forest tree that represents a best-fit model to the data for newborns. With 4 levels of the decision tree, the R 2 score is 0.65

figure 22

A random forest tree using only a tree of depth 3 that represents a best-fit model to the data for non-newborns. The R 2 score is 0.28. We can generate trees with greater depth that better fit the data, but we have shown only a depth of 3 for the sake of readability in the printed version of this paper. Otherwise, the tree would be too large to be legible on this page. The main point in this figure is to showcase the ease of interpretation of the working of the model through rules

We used tenfold cross validation to determine the regression scores. The results are summarized in Tables  6 and 7 .

We computed the multi-class classifier metrics for logistic regression, using one-hot encoding for non-newborns. The results are presented in Table  8 . The first row represents the accuracy of the classifier when Class 0 is compared against the rest of the classes. A similar interpretation applies to the other rows in the table, ie one-versus-rest. The macro average gives the balanced recall and precision, and the resulting F1 score. The weighted average gives a support (number of samples) weighted average of the individual class metric. The overall accuracy is computed by dividing the total number of accurate predictions, which is 49,686 out of a total number of 105,932 samples, which yields a value of 0.47.

For the category of non-newborns, Fig.  23  provides a graphical plot that visualizes the ROC curves for the different multiclass classifiers we developed.

figure 23

This figure applies to data concerning non-newborns. We show the multiclass ROC curves for the performance of the catboost classifier for the different classes shown. The area under the ROC curve is 0.7844

In Table  9 we compare the performance of our multiclass classifier using logistic regression developed on 2016 SPARCS data against 2017 SPARCS data.

In order to compare the performance of the different classifiers, we computed the AUC measures reported in Table  10 . Figure 24 visualizes the data in Table 10 and Fig. 25 visualizes the data in Table 11 . In Tables 12 and 13 we report the results of computing the Delong test for non-newborns and newborns respectively. In Tables 14 and 15 we report the results of computing the Brier scores for non-new borns and newborns respectively.

figure 24

A bar chart that depicts the data in Table  10 for non-newborns

figure 25

A bar chart that depicts the data in Table  11

Model parameters

In Table  16 we present the parameter and hyperparameter values used in the different models.

Additional results shown in the Appendix/Supplementary material

Due to space restrictions, we show additional results in the Appendix/Supplementary Material. These results are in tabular form and describe the R 2 scores for different segmentations of the variables in the dataset, e.g. according to age group, severity of illness code, etc.

The most significant result we obtain is shown in Figs.  21 and 22 , which provides an interpretable working of the decision trees using random forest modeling. Figure  21 for newborns shows that the birth weight features prominently in the decision tree, occurring at the root node. Low birth weights are represented on the left side of the tree and are typically associated with longer hospital stays. Higher birth weights occur on the right side of the tree, and the node in the bottom row with 189,574 samples shows that the most frequently occurring predicted stay is 2.66 days. Figure  22 for non-newborns shows that the features of “APR DRG Code”, “APR Severity of Illness Code” and “Patient Disposition” are the most important top-level features to predict the LoS. This provides a relatively simple rule-based model, which can be easily interpreted by healthcare providers as well as patients. For instance, the right-most branch of the tree classifies the input data into a relatively high LoS (46 days) when the branch conditions APR DRG Code is greater than 813.55 and the APR Severity of Illness Code is less than 91.

The results in Fig.  19 and Table  4 show that if we restrict our model to specific CCS Diagnosis descriptions such as “coronary atherosclerosis and other heart disease”, we obtain a good R 2 Score of 0.62. The objective of our work is not to cherry-pick CCS Diagnosis codes that produce good results, but rather to develop a single model for the entire SPARCS dataset to obtain a birds-eye perspective. For future work, we can explicitly build separate models for each CCS Diagnosis code, and that could have relevance to specific medical specialties, such as cardiovascular care.

Similarly, the results in Fig.  19 and Table  5 show that there are CCS Diagnosis codes corresponding to schizophrenia and mood disorders that produce a poor model fit. Factors that contribute to this include the type of data in the SPARCS dataset, where information about patient vitals, medications, or a patient’s income level is not provided, and the inherent variability in treating schizophrenia and mood disorders. Baeza et al. [ 69 ] identified several variables that affect the LoS in psychiatric patients, which include psychiatric admissions in the previous years, psychiatric rating scale scores, history of attempted suicide, and not having sufficient income. Such variables are not provided in the SPARCS dataset. Hence a policy implication is to collect and make such data available, perhaps as a separate dataset focused on mental health issues, which have proven challenging to treat.

Figures  16 and 17 show that a better regression fit is obtained when a specific CCS Diagnosis code is used to build the model, such as “Newborn” in Fig.  16 . To put these results in context, we note that it is difficult to obtain a high R 2 value for healthcare datasets in general, and especially for large numbers of patient samples that span multiple hospitals. For instance, Bertsimas [ 70 ] reported an R 2 value of 0.2 and Kshirsagar [ 71 ] reported an R 2 value of 0.33 for similar types of prediction problems as studied in this paper.

Further details for a segmentation of R 2 scores by the different variable categories are shown in the Appendix/Supplementary Material section. For instance, the table corresponding to Age Groups shows that there is close agreement between the mean of the predicted LoS from our model and the actual LoS. Furthermore, the mean LoS increases steadily from 4.8 days for Age group 0–17 to 6.4 days for ages 70 or older. A discussion of these tables is outside the scope of this paper. However, they are being provided to help other researchers form hypotheses for further investigations or to find supporting evidence for ongoing research.

Table 3 shows that the best encoding scheme is to combine target encoding with one-hot encoding and then apply linear regression. This produces an R 2 score of 0.42 for the non-newborn data, which is the best fit we could obtain. This table also shows that significant improvements can be obtained by exploring the search space which consists of different strategies of feature encoding and regression methods. There is no theoretical framework which determines the optimum choice, and the best method is to conduct an experimental search. An important contribution of the current paper is to explore this search space so that other researchers can use and build upon our methodology.

The distribution of errors in Fig.  15 shows that the truncation we employed at a LoS of 8 days produces artifacts in the prediction model as all stays of greater than 8 days are lumped into one class. Nevertheless, the distribution of LoS values in Fig.  4 shows that a relatively small number of data samples have LoS greater than 8 days. In the future, we will investigate different truncation levels, and this is outside the scope of the current paper. By using our methodology, the truncation level can also be tuned by practitioners in the field, including hospital administrators and other researchers.

Our results in Fig.  7 show that certain features are not useful in predicting the LoS. The SHAP plot shows that features such as race, gender, and ethnicity are not useful in predicting the LoS. It would have been interesting if this were not the case, as that implies that there is systemic bias based on race, gender or ethnicity. For instance, a person with a given race may have a smaller LoS based on their demographic identity. This would be unacceptable in the medical field. It is satisfying to see that a large and detailed healthcare dataset does not show evidence of bias.

To place this finding in context, racial bias is an important area of research in the U.S., especially in fields such as criminology and access to financial services such as loans. In the U.S., it is well known that there is a disproportional imprisonment of black and Hispanic males [ 72 ]. Researchers working on criminal justice have determined that there is racial bias in the process of sentencing and granting parole, with blacks being adversely affected [ 73 ]. This bias is reinforced through any algorithms that are trained on the underlying data. There is evidence that banks discriminate against applicants for loans based on their race or gender [ 74 ].

This does not appear to be the case in our analysis of the SPARCS data. Though we did not specifically investigate the issue of racial bias in the LoS, the feature analysis we conducted automatically provides relevant answers. Other researchers including those in the U.K [ 21 ] have also determined that gender does not have an effect on LoS or costs. Hence the results in the current paper are consistent with the findings of other researchers in other countries working on entirely different datasets.

From Table  6 we see that in the case of data concerning non-newborns, the catboost regression performs the best, with an R 2 score of 0.432. The p -value is less than 0.01, indicating that the correlation between the actual and predicted values of LoS through catboost regression is statistically significant. Similarly, the p -values for linear regression and random forest regression indicate that these models produce predictions that are statistically significant, i.e. they did not occur by random chance.

From Table  7 that refers to data from newborns, the linear regression performs the best, with an R 2 score of 0.82. The p -value is less than 0.01, indicating that the correlation between the actual and predicted values of LoS through linear regression is statistically significant. Similarly, the p -values for random forest regression and catboost regression indicate that these models produce predictions that are statistically significant.

We examine the performance of classifiers on non-newborn data, as shown in Tables  10 and 12 . The Delong test conducted in Table  12 shows that there is a statistically significant difference between the AUCs of the pairwise comparisons of the models. Hence, we conclude that the catboost classifier performs the best with an average AUC of 0.7844. We also note that there is a marginal improvement in performance when we use the catboost classifier instead of the random forest classifier. Both the catboost classifier and the random forest classifier perform better than logistic regression. We conclude that the best performing model for non-newborns is the catboost classifier, followed by the random forest classifier, and then logistic regression.

In the case of newborn data, we examine the performance of the classifiers as shown in Tables  11 and 13 . From Table 13 , we note that the p -values in all the rows are less than 0.05, except for the binary class “one vs. rest for class 3”, random forests vs. catboost. Hence, for this particular comparison between the random forest classifier and the catboost classifier for “one vs. rest for class 3”, we cannot conclude that there is a statistically significant difference between the performance of these two classifiers. From Table  11 we observe that the AUCs of these two classifiers are very similar. We also note that only about 10% of the dataset consists of newborn cases.

From Table  14 we note that the Brier score for the catboost classifier is the lowest. A lower Brier score indicates better performance. According to the Brier scores for the non-newborn data, the catboost classifier performs the best, followed by the random forest classifier and then logistic regression. Table 15 shows that for newborns, the random forest classifier performs the best, followed by the catboost classifier and logistic regression. The performance of the random forest classifier and catboost classifier are very similar.

From a practical perspective, it may make sense to use a catboost classifier on both newborn and non-newborn data as it simplifies the processing pipeline. The ultimate decision rests with the administrators and implementers of these decision systems in the hospital environment.

Burn et al. observe [ 21 ] that though the U.S. has reported similar declines in LoS as in the U.K, the overall costs of joint replacement have risen. The U.K. government created policies to encourage the formation of specialist centers for joint replacement, which have resulted in reduction in the LoS as well as delivering cost reductions. The results and analysis presented in our current paper can help educate patients and healthcare consumers about trends in healthcare costs and how they can be reduced. An informed and educated electorate can press their elected representatives to make changes to the healthcare system to benefit the populace.

Hachesu et al. examined the LoS for cardiac disease patients [ 22 ] where they used data from around 5000 patients and considered 35 input variables to build a predictive model. They found that the LoS was longer in patients with high blood pressure. In contrast, our method uses data from 2.5 million patients and considers multiple disease conditions simultaneously. We also do not have access to patient vitals such as blood pressure measurements, due to the limitation of the existing New York State SPARCS data.

Garcia et al. [ 23 ] conducted a study of elderly patients (age greater than 60) to understand factors governing the LoS for hip fracture treatment. They used 660 patient records and determined that the most significant variable was the American Society of Anesthesiologists (ASA) classification system. The ASA score ranges from 1–5 and captures the anesthesiologist’s impression of a patient’s health and comorbidities at the time of surgery. Garcia et al. showed a monotonically increasing relationship between the ASA score and the LoS. However, they did not build a specific predictive model. Their work shows that it is possible to find single variables with significant information content in order to estimate the LoS. The New York SPARCS dataset that we used does not contain the ASA score. Hence a policy implication of our research is to alert the healthcare authorities include such variables such as the ASA score where relevant in the datasets released in the future. The additional storage required is very small (one additional byte per patient record).

Arjannikov et al. [ 25 ] developed predictive models by binarizing the data into two categories, e.g. LoS <  = 2 days or LoS > 2 days. In our work, we did not employ such a discretization. In contrast, we used continuous regression techniques as well as classification into more than two bins. It is preferable to stay as close to the actual data as possible.

Almashrafi et al. [ 27 ] and Cots et al. [ 75 ] observed that larger hospitals tended to have longer LoS for patients undergoing cardiac surgery. Though we did not specifically examine cardiac surgery outcomes, our feature analysis indicated that the hospital operating certificate number had lower relevance than other features such as DRG codes. Nevertheless, the SHAP plots in Fig.  7 and Fig.  8 show that the hospital operating certificate number occurs within the top 10 features in order of SHAP values. We will investigate this relationship in more detail in future research, as it requires determining the size of the hospital from the operating certificate number and creating an appropriate machine-learning model. The Appendix contains results that show certain operating certificate numbers that produce a good model fit to the data.

A major focus of our research is on building interpretable and explainable models. Based on the principle of parsimony, it is preferable to utilize models which involve fewer features. This will provide simpler explanations to healthcare professionals as well as patients. We have shown through Fig.  20 that a model with five features performs just as well as a model with seven features. These features also make intuitive sense and the model’s operation can be understood by both patients and healthcare providers.

Patients in the U.S. increasingly have to pay for medical procedures out-of-pocket as insurance payments do not cover all the expenses, leading to unexpectedly large bills [ 76 ]. Many patients also do not possess health insurance in the U.S., with the consequence that they get charged the highest [ 77 ]. Kullgreen et.al. observe that patients in the U.S. need to be discerning healthcare consumers [ 78 ], as they can optimize the value they receive from out-of-pocket spending. In addition to estimating the cost of medical procedures, patients will also benefit from estimating the expected duration for a procedure such as joint replacement. This will allow them to budget adequate time for their medical procedures. Patients and consumers will benefit from obtaining estimates from an unbiased open data source such as New York State SPARCS and the use of our model.

Other researchers have developed specific LoS models for particular health conditions, such as cardiac disease [ 22 ], hip replacement [ 21 ], cancer [ 26 ], or COVID-19 [ 24 ]. In addition, researchers typically assume a prior statistical distribution for the outcomes, such a Weibull distribution [ 24 ]. However, we have not made any assumptions of specific prior statistical distributions, nor have we restricted our analysis to specific diseases. Consequently, our model and techniques should be more widely applicable, especially in the face of rapidly changing disease trajectories worldwide.

Our study is based exclusively on freely available open health data. Consequently, we cannot control the granularity of the data and must use the data as-is. We are unable to obtain more detailed patient information such as their physiological variables such as blood pressure, heartrate variability etc. at the time of admittance and during their stay. Hospitals, healthcare providers, and insurers have access to this data. However, there is no mandate for them to make this available to researchers outside their own organizations. Sometimes they sell de-identified data to interested parties such as pharmaceutical companies [ 79 ]. Due to the high costs involved in purchasing this data, researchers worldwide, especially in developing countries are at a disadvantage in developing AI algorithms for healthcare.

There is growing recognition that medical researchers need to standardize data formats and tools used for their analysis, and share them openly. One such effort is the organization for Observational Health Data Sciences and Informatics (OHDSI) as described in [ 80 ].

Twitter has demonstrated an interesting path forward, where a small percentage of its data was made available freely to all users for non-commercial purposes through an API [ 81 ]. Recently, Twitter has made a larger proportion of its data available to qualified academic researchers [ 82 ]. In the future, the profit motives of companies need to be balanced with considerations for the greater public good. An advantage of using the Twitter model is that it spurs more academic research and allows universities to train students and the workforce of the future on real-world and relevant datasets.

In the U.S., a new law went into effect in January 2021 requiring hospitals to make pricing data available publicly. The premise is that having this data would provide better transparency into the working of the healthcare system in the U.S. and lead to cost efficiencies. However, most hospitals are not in compliance with this law [ 83 ]. Concerted efforts by government officials as well as pressure by the public will be necessary to achieve compliance. If the eventual release of such data is not accompanied by a corresponding interest shown by academicians, healthcare researchers, policymakers, and the public it is likely that the very premise of the utility of this data will be called into question. Furthermore, merely dumping large quantities of data into the public domain is unlikely to benefit anyone. Hence research efforts such as the one presented in this paper will be valuable in demonstrating the utility of this data to all stakeholders.

Our machine-learning pipeline can easily be applied to new data that will be released periodically by New York SPARCS, and also to hospital pricing data [ 83 ]. Due to our open-source methodology, other researchers can easily extend our work and apply it to extract meaning from open health data. This improves reproducibility, which is an essential aspect of science. We will make our code available on Github to interested researchers for non-commercial purposes.

Limitations of our models

Our models are restricted to the data available through New York State SPARCS, which does not provide detailed information about patient vitals. More detailed physiological data is available through the Multiparameter Intelligent Monitoring in Intensive Care (MIMIC) framework [ 84 ], though for a smaller number of patients. We plan to extend our methodology to handle such data in the future. Another limitation of our study is that it does not account for patient co-morbidities. This arises from the de-identification process used to release the SPARCS data, where patient information is removed. Hence we are unable to analyze multiple hospital admissions for a given patient, possibly for different conditions. The main advantage of our approach is that it uses large-scale population data (2.3 million patients) but at a coarse level of granularity, where physiological data is not available. Nevertheless, our approach provides a high-level view of the operation of the healthcare system, which provides valuable insights.

There is growing interest in using data analytics to increase government transparency and inform policymaking. It is expected that the meaning and insights gained from such evidence-based analysis will translate to better policies and optimal usage of the available infrastructure. This requires cooperation between computer scientists, domain experts, and policy makers. Open healthcare data is especially valuable in this context due to its economic significance. This paper presents an open-source analytics system to conduct evidence-based analysis on openly available healthcare data.

The goal is to develop interpretable machine learning models that identify key drivers and make accurate predictions related to healthcare costs and utilization. Such models can provide actionable insights to guide healthcare administrators and policy makers. A specific illustration is provided via a robust machine learning pipeline that predicts hospital length of stay across 285 disease categories based on 2.3 million de-identified patient records. The length of stay is directly related to costs.

We focused on the interpretability and explainability of input features and the resulting models. Hence, we developed separate models for newborns and non-newborns, given differences in input features. The best performing model for non-newborn data was catboost regression, which used linear regression and achieved an R 2 score of 0.43. The best performing model for newborns and non-newborns respectively was linear regression, which achieved an R 2 score of 0.82. Key newborn predictors included birth weight, while non-newborn models relied heavily on the diagnostic related group classification. This demonstrates model interpretability, which is important for adoption. There is an opportunity to further improve performance for specific diseases. If we restrict our analysis to cardiovascular disease, we obtain an improved R 2 score of 0.62.

The presented approach has several desirable qualities. Firstly, transparency and reproducibility are enabled through the open-source methodology. Secondly, the model generalizability facilitates insights across numerous disease states. Thirdly, the technical framework can easily integrate new data while allowing modular extensions by the research community. Lastly, the evidence generated can readily inform multiple key stakeholders including healthcare administrators planning capacity, policy makers optimizing delivery, and patients making medical decisions.

Availability of data and materials

Data is publicly available at the website mentioned in the paper, https://www.health.ny.gov/statistics/sparcs/

There is an “About Us” tab in the website which contains all the contact details. The authors have nothing to do with this website as it is maintained by New York State.

Gurría A. Openness and Transparency - Pillars for Democracy, Trust and Progress. OECD.org. Available: https://www.oecd.org/unitedstates/opennessandtransparency-pillarsfordemocracytrustandprogress.htm . Accessed 28 June 2024.

Jetzek T. The Sustainable Value of Open Government Data: Uncovering the Generative Mechanisms of Open Data through a Mixed Methods Approach. lCopenhagen Business School, Institut for IT-Ledelse Department of IT Management. 2015.

Move fast and heal things: How health care is turning into a consumer product. The Economist. 2022.  https://www.economist.com/business/how-health-care-is-turning-into-a-consumer-product/21807114 . Accessed 28 June 2024.

New York State Department Of Health, Statewide Planning and Research Cooperative System (SPARCS).  https://www.health.ny.gov/statistics/sparcs/ . Accessed 5 Oct 2022.

Rao AR, Chhabra A, Das R, Ruhil V. A framework for analyzing publicly available healthcare data. In 2015 17th International Conference on E-health Networking, Application & Services (IEEE HealthCom). 2015: IEEE, pp. 653–656.

Rao AR, Clarke D. A fully integrated open-source toolkit for mining healthcare big-data: architecture and applications. In IEEE International Conference on Healthcare Informatics ICHI, Chicago. 2016: IEEE, pp. 255–261.

Rao AR, Garai S, Dey S, Peng H. PIKS: A Technique to Identify Actionable Trends for Policy-Makers Through Open Healthcare Data. SN Computer Science. 2021;2(6):1–22.

Article   Google Scholar  

Rao AR, Rao S, Chhabra R. Rising mental health incidence among adolescents in Westchester, NY. Community Ment Health J. 2021:1–1. 

Boylan J F. My $145,000 Surprise Medical Bill. New York Times. 2020.  https://www.nytimes.com/2020/02/19/opinion/surprise-medical-bill.html . Accessed 28 June 2024.

Peterson K, Bykowicz J. Congress Debates Push to End Surprise Medical Billing. Wall Street J. 2020.  https://www.wsj.com/articles/congress-debates-push-to-end-surprise-medical-billing-11589448603 . Accessed 28 June 2024.

Wang S, Zhang J, Fu Y, Li Y. ACM TIST Special Issue on Deep Learning for Spatio-Temporal Data: Part 1. 12th ed. NY: ACM New York; 2021. p. 1–3.

Google Scholar  

Jones R. lining length of stay and future bed numbers. BJHCM. 2015;21(9):440–1.

Daghistani TA, Elshawi R, Sakr S, Ahmed AM, Al-Thwayee A, Al-Mallah MH. Predictors of in-hospital length of stay among cardiac patients: a machine learning approach. Int J Cardiol. 2019;288:140–7.

Article   PubMed   Google Scholar  

Sen-Crowe B, Sutherland M, McKenney M, Elkbuli A. A closer look into global hospital beds capacity and resource shortages during the COVID-19 pandemic. J Surg Res. 2021;260:56–63.

Article   CAS   PubMed   Google Scholar  

Stone K, Zwiggelaar R, Jones P, Mac Parthaláin N. A systematic review of the prediction of hospital length of stay: Towards a unified framework. PLOS Digital Health. 2022;1(4):e0000017.

Article   PubMed   PubMed Central   Google Scholar  

Lequertier V, Wang T, Fondrevelle J, Augusto V, Duclos A. Hospital length of stay prediction methods: a systematic review. Med Care. 2021;59(10):929–38.

Sridhar S, Whitaker B, Mouat-Hunter A, McCrory B. Predicting Length of Stay using machine learning for total joint replacements performed at a rural community hospital. PLoS ONE. 2022;17(11);e0277479.

Article   CAS   PubMed   PubMed Central   Google Scholar  

CCS (Clinical Classifications Software) - Synopsis. https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/CCS/index.html . Accessed 13 Jan 2022.

Sotoodeh M, Ho JC. Improving length of stay prediction using a hidden Markov model. AMIA Summits on Translational Science Proceedings. 2019;2019:425.

PubMed Central   Google Scholar  

Ma F, Yu L, Ye L, Yao DD, Zhuang W. Length-of-stay prediction for pediatric patients with respiratory diseases using decision tree methods. IEEE J Biomed Health Inform. 2020;24(9):2651–62.

Burn E, et al. Trends and determinants of length of stay and hospital reimbursement following knee and hip replacement: evidence from linked primary care and NHS hospital records from 1997 to 2014. BMJ Open. 2018;8(1);e019146.

Hachesu PR, Ahmadi M, Alizadeh S, Sadoughi F. Use of data mining techniques to determine and predict length of stay of cardiac patients. Healthcare informatics research. 2013;19(2):121–9.

Garcia AE, et al. Patient variables which may predict length of stay and hospital costs in elderly patients with hip fracture. J Orthop Trauma. 2012;26(11):620–3.

Vekaria B, et al. Hospital length of stay for COVID-19 patients: Data-driven methods for forward planning. BMC Infect Dis. 2021;21(1):1–15.

Arjannikov T, Tzanetakis G. An empirical investigation of PU learning for predicting length of stay. In 2021 IEEE 9th International Conference on Healthcare Informatics (ICHI). 2021: IEEE, pp. 41–47.

Gupta D, Vashi PG, Lammersfeld CA, Braun DP. Role of nutritional status in predicting the length of stay in cancer: a systematic review of the epidemiological literature. Ann Nutr Metab. 2011;59(2–4):96–106.

Almashrafi A, Elmontsri M, Aylin P. Systematic review of factors influencing length of stay in ICU after adult cardiac surgery. BMC Health Serv Res. 2016;16(1):318.

Kalgotra P, Sharda R. When will I get out of the hospital? Modeling Length of Stay using Comorbidity Networks. J Manag Inf Syst. 2021;38(4):1150–84.

Awad A, Bader-El-Den M, McNicholas J. Patient length of stay and mortality prediction: a survey. Health Serv Manage Res. 2017;30(2):105–20.

Editorial-Board. The Lancet, HCL and Trump. Wall Street J. 2020.  https://www.wsj.com/articles/the-lancet-hcl-and-trump-11591226880 . Accessed 28 June 2024.

Servick  K, Enserink M. A mysterious company’s coronavirus papers in top medical journals may be unraveling. Science. 2020.  https://www.science.org/content/article/mysterious-company-s-coronavirus-papers-top-medical-journals-may-be-unraveling . Accessed 28 June 2024.

Gabler E, Rabin RC. The Doctor Behind the Disputed Covid Data. New York Times. 2020.  https://www.nytimes.com/2020/07/27/science/coronavirus-retracted-studies-data.html . Accessed 28 June 2024.

Lancet-Editors. Expression of concern: Hydroxychloroquine or chloroquine with or without a macrolide for treatment of COVID-19: a multinational registry analysis. 2020;395:10240. https://www.science.org/content/article/mysterious-company-s-coronavirus-papers-topmedical-journals-may-be-unraveling . Accessed 28 June 2024.

Editorial-Board. Expression of Concern: Mehra MR et al. Cardiovascular Disease, Drug Therapy, and Mortality in Covid-19. N Engl J Med. 2020.  https://www.nejm.org/doi/full/10.1056/NEJMoa2007621 . Accessed 28 June 2024.

Hopkins JS, Gold R. Authors Retract Studies That Found Risks of Using Antimalaria Drugs Against Covid-19. Wall Street J. 2020. https://www.wsj.com/articles/authors-retract-study-that-found-risks-of-using-antimalaria-drug-against-covid-19-11591299329 . Accessed 28 June 2024.

https://www.thelancet.com/pdfs/journals/lancet/PIIS0140-6736(20)31180-6.pdf . Accessed 9 Jan 2022.

Wolfensberger M, Wrigley A. Trust in Medicine. Cambridge University Press. 2019. ISBN-13: 978-1108487191.

Bhattacharya J, Nicholson T. A Deceptive Covid Study, Unmasked. Wall Street J. 2022. https://www.wsj.com/articles/deceptive-covid-study-unmasked-abc-misleading-omicron-north-carolina-students-duke-mask-test-to-stay-11641933613 . Accessed 28 June 2024.

Baker M. 1,500 scientists lift the lid on reproducibility. Nature. 2016;533(7604):452–4.

Begley CG, Ioannidis JP. Reproducibility in science: improving the standard for basic and preclinical research. Circ Res. 2015;116(1):116–26.

Eisner D. Reproducibility of science: Fraud, impact factors and carelessness. J Mol Cell Cardiol. 2018;114:364–8.

Wang F, Kaushal R, Khullar D. Should health care demand interpretable artificial intelligence or accept “black box” medicine? Am College Phys. 2020;172:59–60.

Reyes M, et al. On the interpretability of artificial intelligence in radiology: challenges and opportunities. Radiol Art Intell. 2020;2(3):e190043.

Savadjiev P, et al. Demystification of AI-driven medical image interpretation: past, present and future. Eur Radiol. 2019;29(3):1616–24.

McKinney W. Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. " O’Reilly Media, Inc. 2012.

Pedregosa F, et al. Scikit-learn: Machine learning in Python. J Machine Learn Res. 2011;12:2825–30.

Cass S. The top programming languages: Our latest rankings put Python on top-again-[Careers]. IEEE Spectr. 2020;57(8):22–22.

Tjoa E, Guan C. A survey on explainable artificial intelligence (xai): Toward medical xai," IEEE Transactions on Neural Networks and Learning Systems. 2020.

https://www.health.ny.gov/statistics/sparcs/docs/sparcs_data_dictionary.xlsx . Accessed 28 June 2024.

Design and development of the Diagnosis Related Group (DRG). https://www.cms.gov/icd10m/version37-fullcode-cms/fullcode_cms/Design_and_development_of_the_Diagnosis_Related_Group_(DRGs).pdf . Accessed 5 Oct 2022.

ARTICLE 28, Hospitals, Public Health (PBH) CHAPTER 45. 2023. Available: https://www.nysenate.gov/legislation/laws/PBH/A28 . Accessed 28 June 2024.

Gilmore‐Bykovskyi A, et al. Disparities in 30‐day readmission rates among Medicare enrollees with dementia. J Am Geriatr Soc. 2023.

Rodríguez P, Bautista MA, Gonzalez J, Escalera S. Beyond one-hot encoding: Lower dimensional target embedding. Image Vis Comput. 2018;75:21–31.

Montgomery DC, Peck EA, Vining GG. Introduction to linear regression analysis. 6th ed. John Wiley & Sons; 2021. ISBN-13 978-1119578727.

Random forest regressor in sklearn. Available: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html . Accessed 28 June 2024.

Breiman L. Random forests. Mach Learn. 2001;45:5–32.

Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP. Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci. 2003;43(6):1947–58.

Liaw A, Wiener M. Classification and regression by randomForest. R news. 2002;2(3):18–22.

Böhning D. Multinomial logistic regression algorithm. Ann Inst Stat Math. 1992;44(1):197–200.

Vaid A, et al. Machine Learning to Predict Mortality and Critical Events in a Cohort of Patients With COVID-19 in New York City: Model Development and Validation. J Med Internet Res. 2020;22(11);e24018.

Density Estimation.  https://scikit-learn.org/stable/modules/density.html . Accessed 5 Oct 2022.

CatBoost, a high-performance open source library for gradient boosting on decision trees. Available:  https://catboost.ai/  and https://catboost.ai/en/docs/concepts/python-usages-examples . Accessed 28 June 2024.

PyTorch documentation for torch.nn, the basic building blocks for graphs. Available: https://pytorch.org/docs/stable/nn.html . Accessed 28 June 2024.

Kingma DP, Ba J. Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980. 2014.

Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: unbiased boosting with categorical features," arXiv preprint arXiv:1706.09516. 2017.

Tharwat A. Classification assessment methods. Applied computing and informatics. 2020;17(1):168–92.

Brier GW. Verification of forecasts expressed in terms of probability. Mon Weather Rev. 1950;78(1):1–3.

DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988:837–45.

Baeza FL, da Rocha NS, Fleck MP. Predictors of length of stay in an acute psychiatric inpatient facility in a general hospital: a prospective study. Brazilian Journal of Psychiatry. 2017;40:89–96.

Bertsimas D, et al. Algorithmic prediction of health-care costs. Oper Res. 2008;56(6):1382–92.

Kshirsagar R. Accurate and Interpretable Machine Learning for Transparent Pricing of Health Insurance Plans," presented at the AAAI 2021 Conference. 2021.

Ulmer J, Painter-Davis N, Tinik L. Disproportional imprisonment of Black and Hispanic males: Sentencing discretion, processing outcomes, and policy structures. Justice Q. 2016;33(4):642–81.

Angwin J, J. Larso J, Mattu S, Kirchner L. Machine bias: There’s software used across the country to predict future criminals. And it’s biased against blacks. ProPublica (2016). Google Scholar. 2016;23.

Steil JP, Albright L, Rugh JS, Massey DS. The social structure of mortgage discrimination. Hous Stud. 2018;33(5):759–76.

Cots F, Mercadé L, Castells X, Salvador X. Relationship between hospital structural level and length of stay outliers: Implications for hospital payment systems. Health Policy. 2004;68(2):159–68.

Evans M, McGinty T. Hospital Prices Are Arbitrary. Just Look at the Kingsburys’ $100,000 Bill. Wall Street J. 2021.  https://www.wsj.com/articles/hospital-prices-arbitrary-healthcare-medical-bills-insurance-11635428943 . Accessed 28 June 2024.

Evans M. Hospitals Often Charge Uninsured People the Highest Prices, New Data Show. Wall Street J. 2021. https://www.wsj.com/articles/hospitals-often-charge-uninsured-people-the-highest-prices-new-data-show-11625584448 . Accessed 28 June 2024.

Kullgren JT, et al. A survey of Americans with high-deductible health plans identifies opportunities to enhance consumer behaviors. Health Aff. 2019;38(3):416–24.

Wetsman N. Hospitals are selling treasure troves of medical data — what could go wrong? The Verge. 2021. Available: https://www.theverge.com/2021/6/23/22547397/medical-records-health-data-hospitals-research . Accessed 28 June 2024.

Hripcsak G, et al. Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers. Stud Health Technol Inform. 2015;216:574–8.

PubMed   PubMed Central   Google Scholar  

Gabarron E, Dorronzoro E, Rivera-Romero O, Wynn R. Diabetes on Twitter: a sentiment analysis. J Diabetes Sci Technol. 2019;13(3):439–44.

Statt N. Twitter is opening up its full tweet archive to academic researchers for free. The Verge. 2021. Available: https://www.theverge.com/2021/1/26/22250203/twitter-academic-research-public-tweet-archive-free-access . Accessed 28 June 2024. 

Evans M, Mathews AW, McGinty T. Hospitals Still Not Fully Complying With Federal Price-Disclosure Rules. Wall Street J. 2021.  https://www.wsj.com/articles/hospital-price-public-biden-11640882507 .

Johnson AE, et al. MIMIC-III, a freely accessible critical care database. Scientific data. 2016;3(1):1–9.

Download references

Acknowledgements

We are grateful to the New York State SPARCS program for making the data available freely to the public. We greatly appreciate the feedback provided by the anonymous reviewers which helped in improving the quality of this manuscript.

No external funding was available for this research.

Author information

Authors and affiliations.

Indian Institute of Technology, Delhi, India

Raunak Jain, Mrityunjai Singh & Rahul Garg

Fairleigh Dickinson University, Teaneck, NJ, USA

A. Ravishankar Rao

You can also search for this author in PubMed   Google Scholar

Contributions

Raunak Jain, Mrityunjai Singh, A. Ravishankar Rao, and Rahul Garg contributed equally to all stages of preparation of the manuscript.

Corresponding author

Correspondence to A. Ravishankar Rao .

Ethics declarations

Ethics approval and consent to participate.

Not applicable as no human subjects were used in our study.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary material 1., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Jain, R., Singh, M., Rao, A.R. et al. Predicting hospital length of stay using machine learning on a large open health dataset. BMC Health Serv Res 24 , 860 (2024). https://doi.org/10.1186/s12913-024-11238-y

Download citation

Received : 19 June 2023

Accepted : 24 June 2024

Published : 29 July 2024

DOI : https://doi.org/10.1186/s12913-024-11238-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Machine learning
  • Artificial intelligence
  • Health informatics
  • Open-source software
  • Healthcare analytics

BMC Health Services Research

ISSN: 1472-6963

download research papers using doi

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • Current issue
  • For authors
  • New editors
  • BMJ Journals

You are here

  • Volume 58, Issue 15
  • Development of sports medicine in the International Olympic Committee
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • http://orcid.org/0000-0001-8863-4574 Torbjørn Soligard 1 ,
  • http://orcid.org/0000-0002-6238-608X Kathrin Steffen 2 ,
  • http://orcid.org/0000-0002-7474-8842 Richard Budgett 1 ,
  • http://orcid.org/0000-0003-2294-921X Lars Engebretsen 1 , 2
  • 1 Medical and Scientific Department , International Olympic Committee , Lausanne , Switzerland
  • 2 Oslo Sports Trauma Research Center, Department of Sports Medicine , Norwegian School of Sport Sciences , Oslo , Norway
  • Correspondence to Dr Torbjørn Soligard; torbjorn.soligard{at}olympic.org

https://doi.org/10.1136/bjsports-2024-108201

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

  • Sports medicine

When the senior author (LE) started as Head of Scientific Activities of the Medical and Scientific Department in the International Olympic Committee (IOC) in October 2007, the Mission statement of this new IOC development was to:

Develop the IOC’s medical and scientific activities to position the IOC as the primary reference in sports medicine and sports science.

Increase the positioning of the IOC on the protection of athletes’ health by developing research and education in sports medicine.

Act as the interface between the scientific community on the one hand (universities, research institutes, scientific societies) and the sporting community on the other hand (National Olympic Committees (NOCs), International Federations (IFs)).

17 years later, many projects have been accomplished as part of this mission. This commentary summarises the various programmes and initiatives implemented by the IOC Medical and Scientific Department over nearly two decades.

Research: the main source of knowledge

Surveillance.

Understanding injury epidemiology is a prerequisite and the basis for injury prevention. 1 Thus, we introduced the first full-fledged injury surveillance in the Beijing 2008 Summer Olympics. 2 This study was the first to collect all athlete injury data, not only from the Organising Committee’s polyclinic and medical stations but also directly from the NOCs’ medical staff, to present a complete picture of the epidemiology of injuries occurring at the Olympics. Two years later, in the Vancouver 2010 Winter Olympics, we broadened its scope to also include all athlete illnesses 3 and since then the surveillance studies have been an inherent part of both the Olympic Games 4–9 and the Youth Olympic Games. 10–13 The two latest papers from Tokyo 2020 8 and Beijing 2022 9 shed light on how COVID-19 and accompanying countermeasures impacted the athletes’ injuries and illnesses. Fundamentally, we consider this research to be the underpinning of much of our work and expect it to …

X @TSoligard, @larsengebretsen

Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

Competing interests None declared.

Provenance and peer review Not commissioned; externally peer reviewed.

Read the full text or download the PDF:

IMAGES

  1. How to Download any Research Paper Free using DOI Number

    download research papers using doi

  2. Steps to Download Research Papers For Free Using DOI

    download research papers using doi

  3. Steps to Download Research Papers For Free Using DOI

    download research papers using doi

  4. Simple example of the CrossRef referral process. DOI, Digital Object

    download research papers using doi

  5. What is a DOI? How would I find it and why would I use it?

    download research papers using doi

  6. What is a DOI? How would I find it and why would I use it?

    download research papers using doi

VIDEO

  1. INDEXES

  2. How to Download Research Papers and Articles for Free?

  3. How to download research papers and books for free|How to download research papers from SCI-HUB free

  4. Best sites to find and download research papers for FREE. How to do literature search

  5. How to access and download paid research papers for free (all steps)?

  6. Using OpenVPN to access and download research papers from KKU network

COMMENTS

  1. Sci-Hub: knowledge to everyone

    Sci-Hub is the most controversial project in today science. The goal of Sci-Hub is to provide free and unrestricted access to all scientific knowledge ever published in journal or book form.. Today the circulation of knowledge in science is restricted by high prices. Many students and researchers cannot afford academic journals and books that are locked behind paywalls.

  2. Search by DOI or PMID

    If you find an article that has a PMID or a DOI and aren't sure if we have it you can use the Citation Linker or Libkey.io to search the library resources. If the library doesn't have it, you will be directed to Interlibrary Loan so you can request the article. Update 2022: Libkey has partnered with Retraction Watch to indicate retracted articles.

  3. OA.mg

    Free access to millions of research papers for everyone. OA.mg is a search engine for academic papers. Whether you are looking for a specific paper, or for research from a field, or all of an author's works - OA.mg is the place to find it. Universities and researchers funded by the public publish their research in papers, but where do we ...

  4. Find and Download Scientific Papers

    Welcome to , a dedicated platform for finding and downloading open access scientific papers and other research data. Our mission is to democratize access to scientific information, making it freely available to researchers, students, and curious minds across the globe. Open Access (OA) refers to the practice of providing unrestricted access via ...

  5. How to use Sci-hub to get academic papers for free

    If your country blocks the website, use one of the many free general purpose proxies. I tested hide.me for the purpose of writing this article and it works fine for Sci-hub using the Netherlands exit. 2. Go to the journal publisher's website. Go to the website of whatever article it is you are trying to get.

  6. Open Access Button

    Free, legal research articles delivered instantly or automatically requested from authors. × Getting Started on Safari. Open Access Button. Make sure your bookmarks bar is showing. If not, you can click View, and select "Show Bookmarks Bar." Drag the Button above to your bookmarks bar. ...

  7. Is there a simple way to bulk download a large number of papers from a

    [Scientific PDF download] RESP: Research Papers Search claims to search and download scientific papers. Yet to try it out. Articledownloader is worth exploring; PyPaperBot is well used for downloading scientific articles from DOI or academic database. I'm busy with a fork of Automated Search Helper. A research project by Lech Madeyski team at ...

  8. Best Websites To Download Research Papers For Free: Beyond Sci-Hub

    Unlike other websites to download research papers, Google Scholar provides free access to a vast collection of scholarly literature, making it one of the best websites to download research. ... You can easily search for research papers using: Keywords, DOI, or; Browsing through various open access journals featured on the site.

  9. How to use Sci-Hub

    Click the first link, and it should take you to a page with more information on that particular article. Copy the DOI number: Then, go to Sci-Hub and paste it into the search field: And voilà! It takes us straight to the PDF. To download the PDF just click the download button in the PDF preview or click the button on the left with the ...

  10. How to use Sci-Hub

    We type it into Google and we get this: Click on the first link and it should take you to a page with more information on that particular item. Copy the DOI number: Then go to Sci-Hub and paste it into the search field: That's all! It takes us directly to the PDF. To download the PDF, simply click the download button in the PDF preview or ...

  11. Sci-Hub: removing barriers in the way of science

    A research paper is a special publication written by scientists to be read by other researchers. Papers are primary sources necessary for research - for example, they contain detailed description of new results and experiments. papers in Sci-Hub library: more than 87,977,763 At this time the widest possible distribution of research papers, as ...

  12. Download pdf papers from a list of doi and a Scihub mirror

    Download pdf papers from a list of doi and a Scihub mirror. Resources. Readme License. MIT license Activity. Stars. 4 stars Watchers. 0 watching Forks. 0 forks Report repository Releases 1. Version 0.0.1 Latest Jul 15, 2020. Packages 0. No packages published . Languages. Python 100.0%; Footer

  13. How to use SCI HUB to download research papers for free

    Follow the below steps to download paid researchers papers for free using Sci-Hub. Step 1: Go to the official website of SCI- HUB. Step 2: Enter the Title/ DOI/ URL of the research paper which you want to download/ read using SCI HUB. Step 3: Click on Open or press enter key. Step 4: As soon as you perform step 3, the desired research paper ...

  14. GitHub

    $ python3 sci_hub.py -h usage: sci_hub.py [-h] [--view] target Sci-Hub downloader: Utility to download from Sci-Hub positional arguments: target URL/DOI to download PDF optional arguments: -h, --help show this help message and exit --view Open article in browser for reading

  15. scidownl · PyPI

    Download papers with DOI(s), PMID(s) or TITLE(s) Using option -d or --doi to download papers with DOI, option -p or --pmid to download papers with PMID, and option -t or --title to download papers with titles. You can specify these options for multiple times, and even mix of them.

  16. Find an Article Using a DOI or PMID

    A DOI can take you directly to an online resource, but the Library does not always have access at a publisher site. The DOI lookup links to any online access we have. PMID is a unique identifier used in the PubMed database and can be used to look up abstracts in PubMed. The PMID lookup links to online access through the Library.

  17. Sci-Hub: Download Research Papers and Scientific Articles for free

    Just enter the DOI to download the papers you need for free from scihub. Shihub was launched by the researcher Alexandra Elbakyan in 2011 with the goal of providing free access to research to everyone, not only those who have the money to pay for journals. Many in the scientific community praise hub-sci / sciencehub for furthering the knowledge ...

  18. Research Guides: DOI / PMID Search: Start Here

    If you have a DOI or PMID for an article that you would like to obtain using Purdue Libraries subscriptions or via Inter-Library loan services, simply copy and paste the DOI or PMID in the box above and click search. Examples to try (copy and paste these into the box above): DOI Examples: 10.1186/s12898-019-0263-7. 10.1016/j.seps.2021.101063.

  19. Web of Science: Digital Object Identifier (DOI) search

    Information. Article. DOIs can be searched from the basic or advanced search (field tag DO=).In Web of Science, it is not necessary to include a Boolean OR between DOIs when searching. You can simply copy and paste a list of DOIs into the search box. Depending on the number of special characters in the DOI, you can copy and paste up to 5000 DOIs.

  20. SciHub Downloader

    About this extension. The SciHub Downloader is a Mozilla Web Extension that allows you to conveniently download research papers from SciHub using their DOI (Digital Object Identifier) without leaving your current page. - Right-click on a DOI text or a DOI link to open the corresponding paper in SciHub.

  21. How to find a DOI [Update 2024]

    Usually, you will find it on the first page, either in the header or somewhere close to the title. Alternatively, you can also find it in the "About this article" or "Cite this article" sections. If the DOI isn't available, you can look it up on CrossRef.org by using the "Search Metadata" option. You just have to type in the source's ...

  22. GitHub

    Downloads pdfs via a DOI number, article title or a bibtex file, using the database of libgen(sci-hub) , arxiv - bibcure/scihub2pdf ... If you want to download files from scihub you will need to get PhantomJS. OSX ... $ scihub2pdf --title An useful paper Arxiv... $ scihub2pdf arxiv:0901.2686 $ scihub2pdf --title arxiv:Periodic table for ...

  23. How to download a full research paper using DOI number?

    To use this Sci hub alternative to download free research paper, you have to simply follow the below steps, Write the title of your research paper in the tweet Include the DOI or the full URL to ...

  24. Download Research Papers for Free: Legal and Ethical Methods

    14. PaperPanda - Download Research Papers for Free. PaperPanda is a Chrome extension that uses some clever logic and the Panda's detective skills to find you the research paper PDFs you need. Essentially, when you activate PaperPanda it finds the DOI of the paper from the current page, and then goes and searches for it.

  25. Twice-Yearly Lenacapavir or Daily F/TAF for HIV Prevention in Cisgender

    We thank the trial participants and communities, the investigators and site staff, the members of the Global Community Advisory and Accountability Group, and the members of the independent data ...

  26. Sharing research data for journal authors

    These brief, peer-reviewed articles complement full research papers and are an easy way to receive proper credit and recognition for the work you have done. Research elements are research outputs that have come about as a result of following the research cycle - this includes things like data, methods and protocols, software, hardware and more.

  27. Evening regular activity breaks extend subsequent free-living sleep

    Methods In this randomised crossover trial, participants each completed two 4-hour interventions commencing at approximately 17:00 hours: (1) prolonged sitting and (2) sitting interrupted with 3 min of bodyweight resistance exercise activity breaks every 30 min. On completion, participants returned to a free-living setting. This paper reports secondary outcomes relating to sleep quality and ...

  28. Predicting hospital length of stay using machine learning on a large

    Stone et al. [] present a survey of techniques used to predict the LoS, which include statistical and arithmetic methods, intelligent data mining approaches and operations-research based methods.Lequertier et al. [] surveyed methods for LoS prediction.The main gap in the literature is that most methods focus on analyzing trends in the LoS or predicting the LoS only for specific conditions or ...

  29. Research: the main source of knowledge

    When the senior author (LE) started as Head of Scientific Activities of the Medical and Scientific Department in the International Olympic Committee (IOC) in October 2007, the Mission statement of this new IOC development was to: 17 years later, many projects have been accomplished as part of this mission. This commentary summarises the various programmes and initiatives implemented by the IOC ...

  30. Leading role of Saharan dust on tropical cyclone rainfall in the ...

    The predicted/observed mean Tropical Cyclone Rainrate (TCR) within 600 km of the TC center (R < 600): for (A) the non-DOD model and (B) the DOD model using the scatter density plot (out-of-sample predictions are made for five testing sets and then combined, then 100 bins with equal intervals are generated for the TCR ranges.The count of scatters is summarized within each box); (C) difference ...