  1. Abstract Motivation

    Timetrees depict evolutionary relationships between species and the geological times of their divergence. Hundreds of research articles containing timetrees are published in scientific journals every year. The TimeTree (TT) project has been manually locating, curating and synthesizing timetrees from these articles for almost two decades into a TimeTree of Life, delivered through a unique, user-friendly web interface ( The manual process of finding articles containing timetrees is becoming increasingly expensive and time-consuming. So, we have explored the effectiveness of text-mining approaches and developed optimizations to find research articles containing timetrees automatically.


    We have developed an optimized machine learning system to determine if a research article contains an evolutionary timetree appropriate for inclusion in the TT resource. We found that BERT classification fine-tuned on whole-text articles achieved an F1 score of 0.67, which we increased to 0.88 by text-mining article excerpts surrounding the mentioning of figures. The new method is implemented in the TimeTreeFinder (TTF) tool, which automatically processes millions of articles to discover timetree-containing articles. We estimate that the TTF tool would produce twice as many timetree-containing articles as those discovered manually, whose inclusion in the TT database would potentially double the knowledge accessible to a wider community. Manual inspection showed that the precision on out-of-distribution recently published articles is 87%. This automation will speed up the collection and curation of timetrees with much lower human and time costs.

    Availability and implementation

    Supplementary information

    Supplementary data are available at Bioinformatics online.

  2. This paper discusses how the risk of electricity grid outages is predicted using machine learning on historical data enhanced by graph embeddings of the distribution network. The process of graph creation using different embedding approaches is described. Several graph constructing strategies are used to create a graph, which is then transformed into the form acceptable for ML algorithm training. The impact of incorporating different graph embeddings on outage risk prediction is evaluated. The method used for graph embeddings is Node2Vec. The grid search is performed to find optimal hyperparameters of Node2Vec. The resulting accuracy metrics for a set of different hyperparameters are presented. The resulting metrics are compared against base scenario, where no graph embeddings were used. 
  3. The project mission was to organize a workshop aimed to explore how the US data science community can cooperate with and benefit from collaborations with partners in Serbia and the West Balkan region. The scope included fundamental data science methods and high-impact applications related to big data processing, security and privacy in critical infrastructures, biomedical informatics, and computational archeology. The proposed workshop facilitated closing the gap between data science research in the US and Serbia and the region and brought together data scientists with researchers from disciplines that until recently had little exposure to data science methods, potentially enabling collaborative breakthroughs in those scientific fields. A large fraction of participants from both sides were early career researchers including advanced level graduate students, postdoctoral research associates, and assistant/associate professors within 10 years of obtaining their Ph.D. The participants included a large fraction of female and minority scientists. The workshop objective was achieved by including the following inter-related objectives: (1) Establishing new multidisciplinary international collaborations between data science, mathematics, and sciences that generate big data and require advanced methods; (2) Reinforcing collaboration mechanisms between the NSF and Serbia‚Äôs Ministry of Education, Science and Technological Development and organize joint research projects; and (3) Widening the impact of the workshop, by involving researchers and stakeholders from the West Balkan region. The workshop consisted of four tracks, each co-chaired by 3 investigators from the US, Serbia and another West Balkan country. Tangible outcomes from the workshop include a report describing workshop activities for each of four tracks and a proposal recommending research collaboration areas of interest for all parties and determining collaboration mechanisms and programs to facilitate collaboration. 
  4. Abstract A novel method for real-time solar generation forecast using weather data, while exploiting both spatial and temporal structural dependencies is proposed. The network observed over time is projected to a lower-dimensional representation where a variety of weather measurements are used to train a structured regression model while weather forecast is used at the inference stage. Experiments were conducted at 288 locations in the San Antonio, TX area on obtained from the National Solar Radiation Database. The model predicts solar irradiance with a good accuracy (R2 0.91 for the summer, 0.85 for the winter, and 0.89 for the global model). The best accuracy was obtained by the Random Forest Regressor. Multiple experiments were conducted to characterize influence of missing data and different time horizons providing evidence that the new algorithm is robust for data missing not only completely at random but also when the mechanism is spatial, and temporal. 
