In the ELIXIR-EXCELERATE Task 6.3 one on the main purpose is the evalutation of the tools and the pipelines to discover a “gold-standards” for the metagenomics analysis pipelines.

To evaluate the accuracy and the performance of metagenomic analysis tools and pipelines it is essential to have a realistic set of data that can be used as reference to assess them. We created a set of metagenomes with a high number of taxa, sequencing errors, and unknown reads. It can be use to evaluate a wide range of aspects of the metagenomics analysis tools and pipelines.

Currently six different sets of semi-synthetic marine metagenomes were created using genomics data of the marine organisms with a full genome published in ENA. We have also developed a pipeline to produce these datasets and plan to implement them in CWL. Therefore, the created metagenomes contain genomic data from eukaryotic, prokaryotic and viral organisms selected and mixed to simulate a real marine matagenome. Besides full genomic sequences, annotated sequences were also inserted in the metagenomes to test the functional analysis performance of the tools. To simulate the complexity of a real dataset, an error profile was generated from a real marine metagenome and was used in the simulation of the reads. Also, a large number of shuffle reads were simulated and inserted in the metagenomes.

These semi-synthetic metagenomes is used, to test the two metagenomics pipelines part of Elixir project: META- pipe (ELIXIR-NO) and EBI Metagenomics Portal (EMBL-ELIXIR). Furthermore, the creation of an environment-specific dataset and a pure RNA dataset are planned as well
as a better integration of MarDB database, created from the ELIXIR-NO node, in the developed workflow. The final datasets is the starting point to evaluate the existing and future shotgun metagenomic tools and pipelines.

Friday, September 1, 2017