Research on Gene Research in the Background of Big Data

[China Pharmaceutical Network Technology News] Based on a new generation of sequencers and rapid evolution analysis platforms such as high-performance computing clusters, the field of genetic research has been overwhelmed by massive data. The massive amounts of data generated by many genes, cancer, medical research institutions and pharmaceutical companies are no longer able to be processed and stored properly, and even through conventional communication lines. Often, this data must be quickly stored, analyzed, shared, and archived to accommodate the needs of genetic research. So they had to resort to disk drives and transportation companies to transfer raw data to foreign computing centers, which created a huge barrier to fast access and analysis of data. Equally important to scale and speed, all genomic information can be linked based on data models and categories and labeled in machine or human language so that intelligent data can be broken down into equations, dealing with genes, clinical and environmental The data is applied to the common analysis platform.

Overview

The genomic medical revolution with both opportunities and challenges

Since the launch of the Genome Project by humans, various projects have gradually begun to reveal the mysteries of the association between the human genome and disease. As sequencing technology continues to advance, the genome can be identified for only $1,000.

Figure 1 Decade of advances in genomic medical technology

The Human Genome Project is the first scientific research project to determine the human genome sequence. The project, which lasted 13 years and cost nearly $3 billion, was completed in 2003 and is by far the largest biological cooperation project. Since then, a series of technological advances have emerged in DNA sequencing and large-scale genomic data analysis, and the time and cost of sequencing a single human genome has fallen dramatically, even faster than Moore's Law.

Figure 2 Rapid decline in DNA sequencing costs

(Since 2001, the National Human Genome Research Institute (NHGRI) has tracked all DNA sequencing efforts performed by the National Institutes of Health (NIH)-funded sequencing centers and counted the costs associated with it. A benchmark for important improvements in DNA sequencing. The graph shows significant improvements in DNA sequencing technology and data generation processes in recent years. Source: NHGRI, http://)

As an example of advances in sequencing technology, Illumina released the next-generation sequencer HiSeq X10 in 2014, which can decrypt 18,000 human genomes a year at a cost of only $1,000 per genome. This so-called "Thousand Yuan Genome Technology" makes human genome-wide sequencing cheaper than ever, and is expected to have a huge impact on the healthcare and life sciences industries.

The success of new technologies and research methods has also brought considerable costs, and massive data has become an urgent problem to be solved:

Genomic data has doubled every five months in the past eight years. Gene coding projects give clear meaning to 80% of the genome, so it is especially important to obtain whole genome sequences. The Cancer Genome Study reveals a diverse set of cancer cell genetic variants that are generated and tracked by whole genome sequencing, yielding approximately 1 terabyte of data per analysis. More and more countries have launched genome sequencing projects, such as the United States, the United Kingdom, China and Qatar. These projects will generate hundreds of petabytes of sequencing data.

Requirements for end-to-end architecture

In order to meet the demanding requirements of speed, scale and intelligence in genetic medicine research, an end-to-end reference architecture is needed to cover key functions of gene computing, such as data management (data hub), load orchestration (load orchestrator), and enterprise access (application). Center) and so on. In order to determine the content and priority of the reference architecture (capabilities and capabilities) and mapping solutions (hardware and software), there are three main principles to follow:

Software definition: A software-based abstraction layer for computing, storage, and cloud services that defines the infrastructure and deployment model for future growth and expansion of genomic infrastructure through the accumulation of data volumes and computational load. Data Center: Explosive growth in genomic research, imaging, and clinical data with data management capabilities. Application Ready: Integrate multiple applications into a consistent environment, providing data management, version control, load management, workflow orchestration, and access and monitoring through access.

Figure 3 Example of genomic research reference architecture

The blue color in the figure indicates the genomic research platform, the green indicates the conversion platform, and the purple indicates the personalized medical platform. These three platforms share enterprise-level features: a hub for data management, an orchestrator for load-load management, and an application center for access management.

Architecture deployment master plan

The architecture needs to be deployed with a variety of infrastructure and information technologies. Below are some deployment models and examples of technologies, solutions, and products that are mapped to data hubs, load orchestrators, and application centers.

Figure 4 reference architecture deployment model

As shown in the figure, storage basic technologies (solid state drives, flash memory, ordinary hard disks, clouds), computing (high performance computing, big data, Spark, OpenStack, Docker) and user access to information technology (application workflow, file protocol, Database query, visualization, and monitoring are managed by three enterprise function data hubs, load orchestrators, and application centers.

Many solutions and products can be applied to the model as a deployable platform for genomic research, data transformation and personalized medicine, such as the open source solution Galaxy, the IBM spectrum system solution GPFS?

Based on the reference architecture

Another requirement for an end-to-end reference architecture is to grow the platform and infrastructure by integrating various new and old building blocks that can be mapped to different requirements. These building blocks can be of different types, patterns, sizes, and systems. Architectures such as stand-alone servers, cloud virtual machines, high-performance computing clusters, low-latency networks, extended storage systems, big data clusters, tape archiving or metadata management systems, and more. For building blocks that can be integrated into the architecture, the industry's standardized data format, common software framework and hardware interoperability standards are required to implement and extend the genomic infrastructure in a variety of flexible ways:

Small-scale start: Because it is based on software definitions, systems, platforms, and infrastructure can be quite small to meet a limited budget if key capabilities and capabilities are in place. For example, a clinical sequencing lab can deploy a small system of only one or two servers with a small amount of disk storage and critical software management.

Rapid growth: With the growth of computing and storage, existing infrastructure can scale quickly to large scale without disrupting operations. At the end of 2013, the Sidra Bay Medical Research Center established their own genomic research infrastructure, and subsequently added a new building block (60-node high-performance computing cluster) through the reference architecture, which will eventually be in mid-2014. The storage infrastructure has tripled. This robust capability has made the Sidra Bay an infrastructure provider for the Arab Qatar Genetic Project.

Geographical distribution: This is a new feature in the high-performance computing space that is the sharing and federation of data: data and computing resources are deployed in different locations while still being accessible to users, applications, and workflows. In the reference architecture, the data hub and load orchestrator are closely related to this.

Many of the world's leading healthcare and life sciences organizations are actively exploring such architectures to support their integrated research computing infrastructure. The following sections describe key components, best practices, and project experience for such a reference architecture.

Data hub

Data management is the most fundamental capability of the genomics research platform, because massive amounts of data need to be processed at the right time and place at the right cost. In terms of time, it can be a few hours of data analysis in a high-performance computing system. If the data needs to be recalled from the storage archive for re-analysis, it may take several years. In terms of space, near-line storage or remote physical storage in the cloud can be implemented between local infrastructures.

Data management challenge

The four Vs of big data are precisely the challenges of genomic data management: very large data streams and volumes (volumes), demanding I/O speed and throughput requirements (data access speed Velocity), fast-evolving data types And analytical methods (data diversity Variety), as well as the ability to share and explore the environment and reliability of large amounts of data (data confidence Veracity). In addition, there are regulations (patient data privacy and protection), additional requirements such as seed source management (full version control and audit trail) and workflow orchestration, making data management more difficult.

The amount of data

Genomic data is constantly evolving due to the sharp decline in sequencing costs. For AMRC, an academic medical research center equipped with next-generation sequencing technology, it has become commonplace to double the data storage capacity every 6 to 12 months. As a cutting-edge research institution in New York, AMRC started with 300 terabytes of data storage capacity in 2013. By the end of 2013, the amount of storage surged by more than 1 PB (1000 TB), and the total storage volume was three times over 12 months ago. Even more surprising is that this growth is still accelerating and continues to this day. For some of the world's leading genomic medicine projects, such as the English Genome (UK), the Saudi Arabian Genome (Qatar), the Million Elite Project (USA), and the National Gene Bank of China, the starting point or benchmark for data volume is no longer in gigabytes. The section (TB) counts, but hundreds of thousands of beats (PB).

Data access speed

The genomic platform has a very demanding data access speed for three reasons:

The file is very large: in genetic research, the file is usually used to store the genomic information of the subject, which can be a single patient or a group of patients. There are two main types: binary queues or graphs, ie BAM (generated by genomic sequence alignment) and variant call files, VCF (genetic variants obtained after processing). These files are often larger than 1 TB and can occupy typical genomic data warehouse storage. Half of the amount. In addition, by expanding the scope of the study and using higher coverage resolutions, more genomic information (eg, 30 to 100 times the genome) can be derived, which can result in a rapid increase in stored files. Since genomic research often evolves from studies of rare variants (single patient variants) to common variant studies, a new need arises: the extraction of samples from thousands of patients. Take the assumption provided by the Broad Institute as an example: for 57,000 shared samples, the BAM input file has 1.4PB and the VCF output file has 2.35TB, both of which are massive data measured by existing standards, but may It will become very common in the near future.

Many small files: These files are used to store raw or temporary genomic information, such as sequencer output (like Illumina's BCL format file). They are typically less than 64KB and can account for more than half of the typical genomic data warehouse files. Unlike processing large files, because each file's I/O requires two operations on data and metadata, the load of generating and accessing a large number of files can be very large. If the speed is measured in operations per second (IOPS), the underlying layer The storage system's IOPS can reach millions of times. It is thus conceivable that for AMRC's infrastructure in San Diego, no optimizations have been made for the storage of small file processing, such as BCL conversion (such as Illumina's CASAVA algorithm). The load will be limited due to the limited I/O capacity of the infrastructure ( Especially IOPS), resulting in the exhaustion of computing resources and eventually paralysis. Benchmarks confirm that CPU efficiency drops to single digits as computing power is wasted waiting for data to be in place. To alleviate this computational bottleneck, data caching is required to move I/O operations from disk to memory.

Parallel and workflow operations: To improve performance and speed up time, genome calculations are usually done in batches with well-organized workflows. From small-scale target sequencing to large-scale whole-genome sequencing, parallel operation is indispensable for the load to perform at higher speeds. As hundreds of different workloads run simultaneously in parallel computing environments, storage speeds measured in I/O bandwidth and IOPS will continue to accumulate and explode. AMRC's bioinformatics applications can run concurrently on 2,500 computing cores, creating millions of data objects at a rate of one file per second, whether it's 2,500 directories, 2,500 files per directory, or a single directory. The 14 million files can be processed in a timely manner. And for a data warehouse with 600 million objects, 9 million directories, and only one file per directory, this is just a small part of its many workloads. Since metadata is massive, the IOPS load constrains overall performance, even if a system command that lists files (such as Linux's ls) takes a few minutes to complete, parallel applications such as GATK queues also suffer from this. Low performance. At the beginning of 2014, the file system was significantly revised with a focus on improving the metadata infrastructure. Bandwidth and IOPS performance were significantly improved. Benchmarks showed that the calculation of the genetic disease application was accelerated without any application adjustments. 10 times.

Data diversity

Depending on the storage and access methods, there are many types of data formats, such as intermediate files generated by multi-step workflows, or some output files containing genomic information reference data necessary for life, which requires careful versioning. control. The current conventional approach is to store all data online or nearline at a storage tier regardless of cost, which can lead to a lack of big data lifecycle management capabilities. If the genomic data warehouse takes a long time to scan the file system, migration or backup cannot be completed in time. A large US genomics center has struggled with how to manage fast-growing data after adopting Illumina's X10 genome-wide sequencing algorithm. It takes four days for them to complete the scan of the entire file system, making backups daily or longer impossible. As a result, data is rapidly stacked in a single layer of storage, and metadata scanning performance is declining, leading to a vicious circle of data management.

Another new challenge is the management of data locations. As inter-agency collaboration becomes more common, large amounts of data need to be shared or federated, making geolocation an indispensable feature of data. The same data set, especially reference data or output data, can have multiple copies in different geographical locations, or multiple copies in the same location due to regulatory requirements (eg multiple copies of data resulting from physical isolation of clinical sequencing platforms from research institutions) ). In this case, effectively managing metadata to reduce data movement or replication not only reduces the cost of additional storage, but also reduces the problems associated with version synchronization.

Data confidence

Many complex physical and mental disorders, such as diabetes, obesity, heart disease, Alzheimer's disease, and autism spectrum disorders, to study their multifactorial characteristics, require sophisticated calculations in a wide range of sources, statistics Analyze large flow data (genomic, proteomic, imaging) and observation points (clinical, symptom, environmental, real-world evidence). The combination of global data sharing and networking ensures that the process of accessing and analyzing data is constantly innovating and intelligent in scale and dimensions, and the evolution of databases and file repositories is thus interrelated. Under this premise, data confidence is considered as an indispensable element in research. For example, clinical data (genomics and imaging) need to be properly and completely identified to protect the confidentiality of research topics. Genomic data requires end-to-end traceability to provide complete audit trail and repeatability. The copyright and ownership of the data needs to be properly stated by a multi-user collaboration agency. With built-in features for data accuracy, genomic computing organizations allow researchers and data scientists to share and explore large amounts of data based on context and confidence.

Data hub function

To solve the problems encountered in genomic data management, build a scalable, scalable layer to provide data and metadata to the load, such enterprise-level features can be named data hubs. It can store, move, share and index raw and processed data from massive genomes. It also manages the underlying heterogeneous storage structures from solid state drives or flash to disk, tape, and the cloud.

Figure 5 Data Hub Overview

As an enterprise-class feature that provides data and metadata to all workloads, it defines a scalable, scalable layer that virtualizes and globalizes all storage resources into a global namespace designed to provide four main functions:

High-performance data input and output (I/O) policy-driven Information Lifecycle Management (ILM) efficiently shares large metadata management through caching and necessary replication

For physical deployments, it supports more and more storage technologies as modular building blocks, such as:

Solid-state drives and flash storage systems High-performance fast storage disks High-capacity slow disks (4TB per drive) High-density, low-cost tape libraries Local or globally distributed external storage caches Hadoop-based big data storage cloud-based external storage

Four functions can be mapped to the data hub separately:

I/O Management: There are two capabilities for large and scalable I/O. One is to serve I/O bandwidth for large files like BAM, and the other is to serve IOPS for a large number of small files like BCL and FASTQ. Due to these different needs, traditional rating architectures are difficult to meet performance and scale requirements. Data Hub I/O Management solves this problem by introducing the concept of pooling, separating the I/O operations of small file metadata from the operation of large files. These storage pools, while mapped to different underlying hardware, provide optimal storage performance while still being unified at the file system level, providing a unique global namespace for all data and metadata, and transparent to users.

Lifecycle Management: Manage the entire lifecycle of data created, deleted, and saved. If temperature is used as a metaphor to describe the stage and timeliness of data that needs to be captured, processed, migrated, and archived. Raw data captured using tools such as high-throughput sequencers is the hottest and requires high-performance computing clusters (so-called raw storage) with robust I/O performance. After initial processing, the raw and processed data becomes warmer because it takes a policy-based process to determine the final operation, such as deletion, retention in a long-term storage pool or archive, and so on. This process records the file type, size, usage (such as when the user last accessed) and system usage information in the account file. Any files that meet operational needs are either deleted or migrated from one storage pool to another, such as a larger, but inefficient, and inexpensive storage pool. This target layer can be a tape library that efficiently utilizes underlying storage hardware and significantly reduces costs by providing storage pools and low-cost media such as tape.

Shared Management: The need for data sharing within and between storage facility logical domains. As genomic samples and reference data sets become larger (in some cases, each load can exceed 1 PB), moving and copying data becomes more difficult for sharing and collaboration. To minimize the impact of data replication on data sharing, data hubs need to have three characteristics under shared management, so that data sharing and mobility can occur on private high-performance networks or wide-area networks, and is highly dependent on security and fault tolerance.

Multi-cluster storage: A computing cluster provides direct access to remote systems and accesses data as needed. Cloud Data Cache: A metadata index and a full data set for a specific data warehouse (host) that can be selectively cached asynchronously to a remote (client) system for local fast access. Federated database: enables secure association between distributed databases.

Metadata Management: This feature provides the foundation for the first three points. Storing, managing, and analyzing billions of data objects is a must for any data warehouse, especially for data warehouses that exceed PB levels, and this is becoming a trend in genomic infrastructure. Metadata includes system metadata such as file name, path, size, pool name, creation time, modification or access time, etc., as well as custom metadata in the form of key-value pairs, which are used by applications, workflows, or users. The files used can be associated with them to achieve the following goals.

Place and move files based on size, type, or usage to facilitate I/O management. Enable policy-based data lifecycle management based on lightning scans of metadata to gather information. Enable data caching so that metadata can be lightly distributed and weakly dependent on the network.

Data Hub Solutions and Application Cases

Spectrum-scale features are high-performance, scalable, and scalable. Developed for high-performance parallel computing optimization, the spectrum scale serves high-bandwidth big data across all parallel compute nodes in a computing system. Given that genomic workflows can be composed of hundreds of applications that are involved in parallel data processing of large numbers of files, this ability is critical to providing data for computing genetic workflows.

Because the genome workflow can generate a lot of metadata and data, the file system of the system pool with high IOPS solid state hard disk and flash memory can focus on storing metadata as files and directories, and in some cases directly as small files. . This greatly improves the performance of the file system and the responsiveness of heavy-duty metadata operations, such as listing all the files in the directory.

For file systems that can perform big data parallel computing, data hubs can serve big data parallel computing and big data jobs at the same compute node, eliminating the complex requirements of the Hadoop Distributed File System (HDFS).

Policy-based data lifecycle management capabilities allow data hubs to move data from one storage pool to another, maximizing I/O performance and storage efficiency, and reducing operational costs. These storage pools range from high I/O flash drives to high-capacity storage infrastructure to low-cost tape media that inherits tape management solutions.

The increasing fragmentation of genomic research infrastructure also requires greater or even data management on a global scale. Data needs to be moved or shared not only at different locations, but also with workloads and workflows. To achieve this goal, data hubs rely on spectrum-scale Active File Management (AFM) for sharing. AFM extends the global namespace to multiple sites, allowing shared metadata directories or mapping remote client home directories to be localized as cached copies. For example, a genomic research center can own, operate, and version control all reference databases or data sets, and affiliates, partner sites, or centers can access reference data sets through this shared function. When the core copy of the database is updated, cached copies of other sites are also updated quickly.

With a data hub, the system-wide metadata engine can also be used to index and search all genomic and clinical data to mine powerful downstream analysis and transformational research capabilities.

Load organizer

This section describes the challenges of genome load orchestration and uses orchestration tools to help reduce load management.

Genomic load management challenges

Genome load management is very complex. As genomic applications grow, their maturity and programming models continue to differentiate: many are single-threaded (such as R) or easy to parallel (such as BWA), and some are multi-threaded or MPI-enabled (such as MPI). BLAST). But the same is true, all applications need to work in high-throughput, high-performance mode to produce the end result.

Orchestration function

With orchestration tools, you can orchestrate resources, workloads, and workflows. Load Manager and Workflow Engine to link and coordinate a range of spectrum-level computing and analysis jobs to a fully automated workflow that is easy to build, customizable, shareable, and versatile platform running, for GPUs with high performance computing clusters or clouds The underlying infrastructure of a big data cluster provides the necessary application abstraction.

Figure 6 Overview of the load organizer
The Orchestrator is an enterprise-level feature that can be used to orchestrate resources, load, and management traceability and is designed into four main functions:

Resource management: Dynamically and flexibly allocate computing resources on demand. Load management: Effectively perform load management by assigning jobs to different computing resources such as local or remote clusters. Workflow management: Link applications together through logical and automated processes. Traceability management: Associate metadata records and save workloads and workflows.

Based on workflow logic and application requirements (such as architecture, CPU, memory, I/O), by mapping and distributing the load to resilient heterogeneous resources (such as HPC, Hadoop, Spark, OpenStack/Docker, Cloud), the orchestrator is An abstraction layer is defined between different computing infrastructures and high-growth genome computing arrays.

Resource manager

This feature allocates computing resources in a policy-driven manner to meet the computational needs of the genomic load. The most commonly used resource is the High Performance Computing Bare Metal Cluster (HPC). The resource manager provides one-time resources or resources that can be dynamically converted and allocated. If the data hub I/O management provides a storage service layer, then the resource manager can be considered to provide computing services. In addition, new infrastructure can be added to resource pools, including big data Hadoop clusters, Spark clusters, OpenStack virtual machine clusters, and Docker clusters.

Managing resource conversion based on load information is a basic requirement for resource managers. For example, for a genomic infrastructure shared by a batch comparison job and a Spark machine learning job, the load will fluctuate at runtime, and the resource manager can transfer resources through perceived utilization to support the operation of each job in the form of a calculation slot or container. .

Load manager

Genomic computing resources need to be effectively shared, used, and provide optimal performance to the genome application under the control of the resource manager. The Load Manager handles demanding, distributed mission-critical applications such as Illumina's ISSA C, CASAVA, bcltofas ​​tq, BWA, Samtools, SOAP (short oligonucleotide analysis software package) and GATK. The load manager also requires a high degree of scalability and reliability to manage large jobs submitted in batches, which is a common requirement for medium to large genome computing organizations. For example, a genome computing cluster in a medical school in New York usually needs to process a queuing system with 250,000 jobs, without crashing or crashing. In some of the world's largest genomic centers, load manager queues sometimes have millions of jobs. For different genomic research applications with different maturity, architectural requirements (such as CPU, GPU, large memory, MPI, etc.), the load manager provides the necessary resource abstraction to enable jobs to be submitted, placed, monitored, and Keep transparent to users when recording.

Workflow engine

For workflow management of the genome, the workflow engine is dedicated to connecting jobs into a logical network. The network allows the computational flow to be performed linearly in multiple steps, such as sequence alignment, composition, and then deformation extraction, or it can be run with more complex branches based on user-defined criteria and completion conditions.

The orchestrator workflow engine requires dynamic, fast and complex workflow processing capabilities. Independent loads and jobs can be defined into standard workflow templates through user interfaces, combined with variables, parameters, and data. There are many types of workloads that can be integrated into workflow engines, such as parallel high-performance computing applications, big data applications, or R-scripts that analyze workloads. Once defined and verified, users can use the template to launch workflows directly from their workstations or to the corporate site for others to use.

The workflow orchestration engine also needs to provide the following features:

Job Array: Maximizes the throughput of the genome sequencing analysis workflow. Special types of workloads can be divided into multiple parallel jobs by job array.

Sub-process: Multiple sub-processes can be defined to perform variant analysis in parallel after genome alignment. The results of each sub-process can be combined into a single output for the analyst to compare with multiple tools.

Reusable modules: Workflows can also be designed as a module to embed larger workflows as dynamic building blocks. This will not only effectively build and reuse workflows, but also help users of large scientific research institutions to better coordinate and share genomic workflows.

Figure 7 Genomic workflow integrated with the arranger

In the figure, there are the following components from left to right:

Box 1: Data (such as a BCL file) automatically triggers CASAVA as the first step in the workflow. Box 2: Dynamic substreams use BWA alignment sequences. Box 3: Samtool post-processes the way the job array is run. Box 4: Different variant analysis substreams are triggered in parallel.

The Genome Workflow combines some applications and tools to process raw sequence data (BCL) into variant (VCF) data. Each box represents a workflow function module that consists of a genomic application that maps to functions such as genomic base conversion, sequence alignment, pre-processing, and variant extraction and analysis. These modules themselves can be integrated as separate workflows and connected to a larger workflow in logical and conditional relationships.

As more organizations deploy hybrid cloud solutions with distributed resources, the orchestrator can balance the load based on data location pre-defined policies, thresholds, and resource availability real-time inputs. For example, workflows can be designed to process genomic raw data to better fit the needs of the sequencer and use the MapReduce model of the remote big data cluster for sequence alignment and combination; it can also be designed to achieve 50% completion rate when genetic processing When the agent event is triggered, the data is transferred from the satellite system to the central high-performance computing cluster, so that data migration and calculation can be performed concurrently to save time and cost.

The release of the genome process by research institutions to share with others is a need for another orchestrator. As workflow templates can be saved and distributed, some major cancer and medical research institutions in the United States and Qatar have begun to collaborate through the exchange of genomic workflows.

Traceability management

There are many computational methods and applications that can be applied to collect, analyze, and annotate genomic sequences. Applications, benchmark data, and runtime variables are important traceability information that can have a significant impact on the interpretation and maintenance of genomic analysis. Currently, seldom use standards or conventions to capture traceability information because it can lead to the loss of important computational analysis data. This problem is also lurking in other factors, such as complex data, workflows or channels as a high-level analysis process, or applications that are frequently released for updates.

Therefore, traceability management becomes an important function that the orchestrator needs to be comparable to the data hub metadata management function. Traceability data can also be understood as load metadata. The functional requirements of the traceability manager are to capture, store, and index user-defined traceability data, traced back to any existing computational load or workflow in a transparent, non-disruptive manner.

Based on this demand, a variety of technologies and solutions are being developed, and some have been completed and put into commercial use, such as Lab7's ESP platform and General Atomics' Nirvana. IBM is also working on a large-scale, near real-time metadata management system that works with data hubs and orchestrators.

application Center

Overview

The application center is the user interface for accessing data hubs and load orchestrators. It provides an enterprise portal based on role access and security controls, making it easy for researchers, data scientists, and clinicians to access data, tools, applications, and workflows. Its goal is to enable researchers and data scientists without computer programming experience to use complex genomic research platforms.

The application center has reusable advantages and can be used as part of a personalized transformational gene medicine platform.

Figure 8 Application Center Overview

The diagram depicts starting and monitoring the load, querying and browsing data, visualizing the output, and tracking system logs and usage information. It defines the abstraction layer between users (researchers, doctors, and analysts) and data hubs and load orchestrators.

Application center requirements

The requirements for the application center include the following two points:

Site-based directory functionality: It accesses and visualizes applications, workflows, and datasets. Monitoring capabilities: Monitor, track, report, and manage specific application information.

Site-based directory function

Data scientists often want to intuitively access genomic workflows and data sets, and genomic analysis is often extremely complex. To minimize the barriers between the two, the application center catalogue emerges. It provides pre-compiled and pre-verified application templates and workflow definitions that allow users to simply launch jobs or workflows in the site.

Figure 9 Application Center Genome Workflow

The figure shows the end-to-end genomic workflow (BWA-GATK), which is launched and visualized through the application central site, starting from the left:

Box 1: The workflow is automatically triggered to start working when the data arrives. Box 2: Dynamic substreams for sequence alignment using BWA. Box 3: Post-processing of the job array using Samtool. Box 4: BAM file recalibration. Box 5: GATK performs variant extraction.

The Application Center Catalog can be configured with a Cloud Data Browser to manage the data needed for genome calculations. In a site-based browser, users can find genomic data by browsing and searching for files and directories of all remote or local storage servers (data hubs). You can add a file to start the job no matter where the file is.使用数据浏览器,用户可通过标记文件目录快捷的找到它。例如,一个为基因组计算用户标记的可用目录能用来存储经常访问的参考数据集。

最后,数据浏览器也可以方便数据传输,用户可把文件从浏览器桌面拖放到当前远程目录以同时上传多个文件。

实时监控

应用中心监控还需提供了一个基于门户的仪表板,提供全面的负载监控、报告和管理功能。作为监控工具,不仅单方面专注于系统监控,还提供完整的、集成化的负载监控设施。通过基因组应用程序的多样化配置(如大内存、并行或单线程),跟踪和汇总同作业与应用程序相关的计算机CPU、内存和存储I/O实用信息,帮助提高应用程序效率。

Conclusion

为了满足基因研究对于速度、规模和智能化的苛刻需求,面向负责创建和提供生命科学解决方案的专业技术人员(如科学家,咨询顾问,IT架构师和IT专家等),该领域出现的端到端参考架构正结合各种基础设施和信息技术被部署到越来越多的研究机构中,而基于这种架构的客户和合作伙伴生态系统也在不断生长,逐步丰富着相应的解决方案和产品。随着技术的发展,基因药物有望彻底改变生物医学研究和临床护理。结合生物学途径、药物相互作用机理及环境因素对人类基因进行研究,使得基因科学家和临床医生有可能识别疾病高危人群,为他们提供基于生化标志的早期诊断,并推荐有效的治疗方法。

作者简介:仙伟(),2011年加入IBM至今,从事软件研发工作,研究方向为自动化工作流管理和高性能计算。

Resectoscope Electrode

We're professional Resectoscope Electrode manufacturers and suppliers in China, specialized in providing high quality medical instruments with reasonable price. We warmly welcome you to buy or wholesale bulk Resectoscope Electrode for sale here and get quotation from our factory.

Resectoscope Electrode,Resectoscope Loop,Bipolar Resectoscope,Resectoscope Working Element

Tonglu WANHE Medical Instrument Co., Ltd , https://www.vanhurhealth.com