Reverse engineering the transcriptome via Machine Learning methods
Genes and their corresponding pathways form networks that regulate various cellular functions that are critical in tumor development. These networks, coined Gene Regulatory Networks (GRNs), define the regulatory relationships among genes and provide a concise representation of the transcriptional regulatory landscape of the cell. Further, different phenotypes can lead to activation of different functional pathways by different global rewiring of the underlying GRNs. Further, with the use of single-cell RNA sequencing (scRNA-Seq) technologies, it is now possible to acquire gene expression data for each individual cell in samples containing up to millions of cells. These cells can be further grouped into different states along an inferred cell differentiation path, which are potentially characterized by similar, but distinct enough, gene regulatory networks. Hence, it would be desirable for scRNA-Seq GRN inference methods to capture the GRN dynamics across cell states. However, current GRN inference methods produce a unique GRN per input dataset (or independent GRNs per cell state), failing to capture these regulatory dynamics.
We are currently developing several methods for uncover the underlying transcriptional rewiring based on both bulk and single cell RNA-Seq data, and applying them to several diseases. We envision that the developed methods will provide additional novel insights to the understanding of the key transcriptional rewiring associated with several cancer malignancies, increased by our ability to follow up in the clinic.
Standards-based compression and encryption of genomic data for cancer research
Genomic medicine is an emerging field in cancer and is evolving rapidly regarding technology, the magnitude of data generation, organization and analyses. Currently, next-generation sequencing (NGS) technologies (such as whole genome sequencing (WGS), whole exome sequencing (WES), transcriptome, etc.) are used in a research setting; it is still not a common practice to use omics data in clinical care. At present, the majority of cancer patients receive treatments that are minimally informed by omics data; electronic medical records consist of patient-centric data (such as cancer diagnoses, medical history, demographics, imaging, vital signs, lab results, medications, etc.). However, cancer genomic data is not stored or queried through electronic medical records. As the price of sequencing goes down, the majority of the cancer patients will be sequenced in the clinic for diagnosis and treatment and as more samples are sequenced data management needs will arise, and large volumes of data must be compressed, stored, and securely and efficiently accessed.
Given the pressing matter to address the genomic data growth issues, there has been several specialized compressors for genomic data proposed in the last 10 years, some achieving outstanding compression gains. Among those, the CRAM format is the one that has seen wider adoption. However, the original human-readable format (known as FASTQ and SAM files) are still largely compressed using inefficient general-purpose compression schemes, such as gzip. The main challenges for the adoption of efficient specialized formats are i) the lack of guarantees for long-term support, ii) technical limitations for efficient fine-grained selective access on the compressed data, and iii) lack of support for integrated annotation and encryption of compressed genomic information. Therefore, an urgent need exists to develop new formats for the representation of genomic information that can address these limitations and the informatic tools that draw on these new formats to provide the field of cancer genomics with a unified framework for effective and secure data storage and retrieval.
We are developing a new format for genomic information representation that builds upon the current efforts by the International Organization for Standardization (ISO) to generate a set of specifications for a standard-based representation of these data. We are also building the informatic tools that act on the newly proposed formats to generate the needed framework for efficient and secure handling of the genomic data. The proposed framework has the potential to accelerate the field of cancer genomics, since it provides the means to aid (re-) analysis of the genomic data by allowing efficient selective access on the compressed data, to reduce storage and transmission burden, to provide a secure and efficient sharing of re-analyzed data, and to annotate newly discovered copy number variations (CNVs) directly on the compressed data, among others.
"Uncovering biomarkers via machine learning methods to improve diagnosis, prevention and treatment of diseases in the context of personalize medicine”,
Dr. Mikel Hernáez, Program Director.