Google cloud for biomedical research has contributed for the finding of antivirus as the new coronavirus model, COVID-19, pandemic hits the world. The usage of the cloud follows up the release of CORD19-dataset on Kaggle, the world’s largest online data science.

The dataset was published by the White House and a coalition of research groups. Kaggle stores freely available more than 150,000 scholarly articles, thousands of which are on COVID-19. In addition, Kaggle has millions of medical publications with information that will add knowledge on COVID-19 and other diseases.

The extensive information on Kaggle is what makes its existence caught the attention from global health practitioners and medical community. The problem thus arises: how can they digest and analyse the data, most of which are not readily consumed by machines? How can they study the data, some of which can be processed by using natural processing language?

Fortunately, there is a group of data scientists known as Machine Learning Google Developer Experts or ML GDEs. They implement their knowledge on big data and AI for finding new insights from the research literature in faster ways. They created “AI versus COVID-19) or aiscovid19.org with the helps from Google Cloud and the TensorFlow Research Cloud or TFRC.

Utilizing Google Cloud for biomedical research comes in some steps. The ML GDE team designed the dataset so that biomedical researchers can understand their workflows, tools, challenges and relevance in medical literature. Thus, they formed one single dataset containing a very large corpus of papers then turning them into dataset that’s available in machine usable formats.

The second phase is introducing Biomedical Research Extensive Archive to Help Everyone (BREATHE), a large-scale biomedical database containing entries from top biomedical research repositories. Over 16 million biomedical articles in English Language is inside the BREATHE project. The first version was released in June 2020 and will do so for new versions. BREATHE can be read by machine and can be accessed by public for free.

Google cloud for biomedical research at the BREATHE stage firstly verified the content licensing, ensuring it complies to the source’s terms of use then utilizing APIs and FTP servers when available. The team used ethical scraping philosophy to ingest the public data.

Read also: Measures from AWS Helping Coronavirus Pandemic for Customers

The team examined more than 16 million articles from ten different sources, each one with raw data formatted in CSV, JSON or XML and its own unique schema. It applied Google Dataflow for processing the data. At this stage, the team processes every single raw document, cleans, normalizes and applies multiple heuristic rules for extracting a final general schema, formatted in JSONL.

After that, the team loaded the date into Google Cloud Storage buckets and Google BigQuery tables. The team applied standard Structured Query Language or SQL for exploring the contents of the data loaded into BigQuery.

The next platform under the umbrella of Google cloud for biomedical research is Google Public Dataset Program. The team hopes the dataset can be accessed by other data scientists. The dataset is hosted in Google BigQuery and is included in BigQuery’s free tier. Anyone interested at the data can access the BREATHE dataset using simple SQL commands.

To create the BREATHE dataset, the team used Google Networking & Compute, Google Dataflow, Google BigQuery (BQ), Google Cloud Storage (GCG), Google Cloud Public Dataset Program, Selenium, Google Colab and Python 3. The BREATHE dataset can help everyone understanding biomedical research data and mining useful information for finding antivirus or other solutions amidst the COVID-19 pandemic.

Contact us for obtaining solution and more information about implementing cloud computing for your business