First joint meeting between Institute of Statistical Science, Academia Sinica, Taiwan (ISSAS) and the Institute of Statistical Mathematics (

Section

Credits

Dates: 29 (Thu.), 30 (Fri.) November 2007
Venue: Auditorium, The Institute of Statistical Mathematics Tokyo, Japan

Program:29 November 2007

9:50 - 10:00 Opening ceremony
Greeting of Director-General of ISM
Greeting of Director of ISSAS

10:00 - 10:45 Ker-Chau Li (ISSAS, Director)
Finding disease candidate genes by liquid association

10:45 - 11:30 Shinto Eguchi (ISM)
Boosting learning approach to association studies in bioinformatics

11:30 - 12:15 Hsin-Chou Yang (ISSAS)
KBAT: Kernel-based association test

12:15 - 13:30 Lunch

13:30 - 14:15 Satoshi Kuriki (ISM)
Multiplicity adjustments in detecting reproductive barriers caused by loci interactions

14:15 - 15:00 Chen-Hsin Chen (ISSAS)
A Statistical Platform for Microarray Gene Expression Experiments and Some Studies in Genomic Statistics

15:00 - (15:30) Oral introductions of posters (Each 5 minutes)

Tomoyuki Higuchi (ISM)
Biological Information fusion for estimating gene networks via Bayesian modeling and computation

Grace Shieh (ISSAS)
A Pattern Recognition Approach to Infer Genetic Networks

Haque Mollah (ISM)
QTL Analysis With the Boosting Features for the Shape Variation of F2-Mice

Lung-An Li (ISSAS)
Analyzing data from Flow Cytometry Experiments

Rui Yamaguchi (ISM Collaborator)
Detecting Activated Transcription Factors from Gene Expression Data of Mice Treated with Kampo Medicine by Statistical Absolute Evaluation Method

(15:30) - 17:00 Poster presentaion and free discussion

18:00 - Party

Program:30 November 2007

10:00 - 10:45 Chun-Houh Chen (ISSAS)
Matrix Visualization and Information Mining for High-Dimensional Bio-Medical Data Structure

10:45 - 11:30 Ryo Yoshida (ISM)
Genomic Data Assimilation for Inferring Gene Regulatory Networks from Gene Expression Profiles

11:30 - 12:15 Wei-chung Liu (ISSAS)
On the topological importance of enzymes and their phylogenetic frequency

12:15 - 13:30 Lunch

13:30 - 14:15 Kenji Fukumizu (ISM)
Dimension reduction with positive definite kernels

14:15 - 15:00 Chen-Hung Kao (ISSAS)
Statistical Methods for Quantitative Trait Loci Mapping

15:00 - (15:30) Oral introductions of posters (Each 5 minutes)

Hironori Fujisawa (ISM)
A unified method for detecting single feature polymorphisms and gene expression level differences

Shinsheng Yuan (ISSAS)
Context-dependent Clustering for Dynamic Cellular State Modeling of Microarray Gene Expression

Osamu Komori (ISM)
Flexible Combinations of Covariates by Boosting the Area under the ROC Curve

Hsuan-Yu Chen (ISSAS)
A Five-Gene Signature and Clinical Outcome in Non-Small-Cell Lung Cancer

Mari Pritchard (ISM)
Boosting ordinary class labels in statistical pattern recognition

Osamu Hirose (University Tokyo)
Statistical inference of transcriptional module-based gene networks fromtime course gene expression profiles by using state space models

(15:30) - 17:00 Poster presentaion and free discussion

17:00 Closing ceremony

Abstracts:

Finding disease candidate genes by liquid association.

Ker-Chau Li (ISSAS).

The fast-growing public repertoire of microarray gene expression databases provides individual investigators with unprecedented opportunities to study transcriptional activities for genes of their research interest at no additional cost. Here we demonstrate that such open resources hold the promise of benefiting numerous projects aiming at solving detail genetic profiles predisposing to complex diseases and their trait components. Starting from MBP (myelin basic protein), a major gene associated with the underlying molecular pathology of multiple sclerosis (MS), and PRKCA(protein kinase C), a susceptible gene recently identified by fine mapping the MS locus on 17q22-q24, we conducted a genome-wide study on four large-scale gene expression databases, using an on-line computation system which features the novel method of liquid association. A string of findings point to the gene SLC1A3 (glial high glutamate transporter 3), portraying a coherent web of molecular evidence that is consistent with the glutamate-induced excitotoxicity hypothesis about demyelination and axonal damage in MS. Previously unknown dynamic patterns of transcript coregulation between the consensus HLA locus and three other major MS loci identified in the earlier Finnish scans get also revealed. We further validate the connection of SLC1A3 to MS by genotyping intragenic single-nucleotide-polymorphisms(SNP) in MS families from the high-risk population of Finland. Our biocomputing-driven approach to complex trait study complements both the traditional candidate-gene approach, which is confined by the biomedical knowledge, and the "hypothesis-free" full-genome scans, which often leads to a plurality of wide loci and leaves little clues on their functional relevance to the molecular basis of the pathogenesis.

KBAT: Kernel-based association test.

Hsin-Chou Yang (ISSAS).

Association mapping (i.e., linkage disequilibrium mapping) is a powerful tool for positional cloning of disease genes. We propose a kernel-based association test (KBAT), which is a composite function of "p-values of single-locus association tests" and "kernel weights related to intermarker distances and/or linkage disequilibria". KBAT is a general form of many current statistics tests. This method can be applied to the study of candidate genes and can scan whole chromosomes using a sliding window procedure. We evaluated the performance of KBAT through comprehensive simulation studies that considered evolutionary parameters, disease models, sample sizes, kernel functions, test statistics, window attributes, and genetic/physical maps. The results of 1,000 simulation replicates for each condition showed that KBAT had a high test power and a well-controlled type 1 error compared to existing methods. In addition, KBAT was also applied to the study of a genome-wide data set of alcoholism dependence. In summary, the strengths of KBAT are multi-fold: KBAT is robust against the inclusion of nuisance markers, is invariant to the map scale, and accommodates different types of genomic data, study designs, and study purposes. The proposed methods are packaged in the user-friendly software, KBAT. This is a joint work with Hsin-Yi Hsieh and Cathy SJ Fann.

Multiplicity adjustments in detecting reproductive barriers caused by loci interactions

Satoshi Kuriki (ISM),
Yoshiaki Harushima (National Institute of Genetics),
Hironori Fujisawa (ISM),
Nori Kurata (National Institute of Genetics).

The reproductive barrier is a genetic mechanism to isolate species. We try to detect a type of reproductive barrier cause by the loci interactions. We calculate the chi-square test statistics from two-way tables cross-classified by genotypes of all pairs of dense markers over all chromosomes, and detect the interactions. Since the number of the tests becomes huge, we need to consider the multiplicity of test. By assuming a standard model for the linkage, the chi-square test statistics are shown to be regarded as a chi-square random field with a direct-product covariance structure. Then, the adjusted p-value can be obtained by evaluating the distribution of the maximum of the random field. We give two methods for it: One is a Monte Carlo simulation based on a spatial AR model, and the other is an analytic method based on the nonlinear renewal theory.

On the topological importance of enzymes and their phylogenetic frequency

Wei-Chung Liu (ISSAS).

A metabolic network is the sum of all chemical transformations in a cell, with metabolites interconnected by enzyme-catalyzed reactions. Heterogeneity is often observed among different enzymes with some enzymes occurring in numerous bacterial species and some occurring only in a few. In this study, we wish to know whether the distribution of enzymes across different bacterial species is a random process. We then ask if there are relationships between the number of different bacterial species having a given enzyme and the enzyme's topological importance in the metabolic network. To do this we construct a complete enzyme network consisting of all currently known enzyme-enzyme relations for 288 bacterial species from the KEGG (Kyoto Encyclopedia of Genes and Genomes) database. We then calculate three different network indices measuring the topological importance of individual enzymes; these are degree, closeness centrality and betweenness centrality. Then we test whether these indices correlate with the phylogenetic frequency of enzymes. Basing on our results, we further discuss the organisation and evolution of an enzyme network.

Statistical Methods for Quantitative Trait Loci Mapping.

Chen-Hung Kao (ISSAS).

Many biologically and economically important traits in higher organisms are quantitative, not qualitative. Traits such as hypertension, diabetes and some genetic diseases in human; height, maturity, stress tolerance and yield of grain in crops; body fat percentage and weight gain in mice; egg and milk production in animals, are all examples of quantitative traits. Genes controlling quantitative traits are called quantitative trait loci (QTL). With the understanding of QTL, it is possible to diagnose human diseases in early stage, breed or genetically engineer superior organisms to obtain desired characteristics such as increased yield and improved quality. In recent years, the advent of fine-scale genetic marker maps for various organism by molecular biology techniques has greatly facilitated the study of QTL systematically. Introduction of QTL mapping and statistical methods for locating the positions and estimating the effects of QTL using genetic marker data as well as the related statistical issues are presented and discussed. Simulated and real examples are used for illustration.

Matrix Visualization and Information Mining for High-Dimensional Bio-Medical Data Structure.

Chun-Houh Chen (ISSAS).

Most statistical techniques, particularly multivariate methodologies, focus on extracting information deposited in data and proximity matrices. Rather than relies solely on numerical characteristics, matrix visualization (MV) allows users to graphically explore structures embedded in give matrices. Visualization of gene expression profile has made MV the most popular data visualization tool nowadays. This talk first discusses issues in the application of MV for exploring continuous data such as gene expression profiling using Generalized Association Plots (GAP), a package we have been developed several years for general purposes MV. Extensions to MV for non-continuous data (binary, nominal) and MV for data with additional information (covariate, cartographic link, longitudinal pattern, dependent structure) will then be discussed. MV for several real bio-medical studies will be illustrated in this talk with a possible software demonstration. Interested users are welcomed to browse our web site and to try the GAP package for continuous/binary data:
http://gap.stat.sinica.edu.tw/

A Statistical Platform for Microarray Gene Expression Experiments and Some Studies in Genomic Statistics.

Chen-Hsin Chen (ISSAS).

The Genomic Statistics (GS) Unit has been working with other three Units of the Advanced Bioinformatics Core (ABC) to collaborate in biomedical projects of the National Research Program for Genomic Medicine, Taiwan. We in the ABC-GS Unit first aimed at high-throughput microarray technology for large-scale gene-expression analysis in functional genomics, and have been developing the Gene Expression Study Design and Analysis Suite (GESDAS) as a statistical platform to facilitate extensive design and analysis of microarray experiments. Constructing a user-friendly web interface for the platform, we integrate relevant statistical packages and information visualization environments, some of which were developed by investigators in our Unit. The GESDAS functions include optimal experimental designs and sample-size determination for the study design, and quality assessments and analyses of both one-channel oligonucleotide arrays and two-channel cDNA arrays. Real datasets will be illustrated in the presentation.
Through upstream methodological collaboration on microarray images, statistical platform development, microarray data analysis, information mining and further to pathway studies, or selection of a novel internal control for Q-RT-PCR, we have successfully built up a critical mass in team work for applying bioinformatics research to gene expression profiling studies in cancer and SARS research.
In addition, we are interested in providing and enhancing statistical analyses for disease heterogeneity with genotypic, endophenotypic, phenotypic and clinical profiles. In the schizophrenia project, we investigated the transitions of patient subtypes between consecutive follow-up times or from acute disease state to subsided state, and gene-gene interactions on schizophrenia based on case-control or familial SNPs studies. We have also developed a high-dimensional visualization tool in the mice flow cytometry project. These rewarding collaborative experiences have become useful prototypes for us to promote service works.

Context-dependent Clustering for Dynamic Cellular State Modeling of Microarray Gene Expression.

Shinsheng Yuan (ISSAS).

Motivation:
High-throughput expression profiling allows researchers to study gene activities globally. Genes with similar expression profiles are likely to encode proteins that may participate in a common structural complex, metabolic pathway, or biological process. Many clustering, classification and dimension reduction approaches, powerful in elucidating the expression data, are based on this rationale. However, the converse of this common perception can be misleading. In fact, many biologically related genes turn out uncorrelated in expression.
Results:
In this paper, we present a novel method for investigating gene co-expression patterns. We assume the correlation between functionally related genes can be strengthened or weakened according to changes in some relevant, yet unknown, cellular states. We develop a context-dependent clustering (CDC) method to model the cellular state variable. We apply it to the transcription regulatory study for Saccharomyces cerevisiae, using the Stanford cell-cycle gene expression data. We investigate the co-expression patterns between transcription factors (TFs) and their target genes (TGs) predicted by the genome-wide location analysis of Harbison.C:2004. Since TF regulates the expression of its TGs , correlation between TF's and TG's expression profiles can be expected. But as many authors have observed, the expression of transcription factors do not correlate well with the expression of their target genes. Instead of attributing the main reason to the lack of correlation between the transcript abundance and TF activity, we search for cellular conditions that would facilitate the TF-TG correlation. The results for sulfur amino acid pathway regulation by MET4, respiratory genes regulation by HAP4, and mitotic cell cycle regulation by ACE2/ SWI5 are discussed in detail. Our method suggests a new way to understand the complex biological system from microarray data.

Analyzing data from Flow Cytometry Experiments.

Lung-An Li (ISSAS).

Data comes from Flow Cytometry Standard Express of the Mouse Mutagenesis Program Core, Academia Sinica. The current available commercial software Flow Cytometry Standard Express only provides 2-D views, while this analysis aims to provide 3-D or 4-D views for simultaneously understanding the data how they correlated with each other in high dimensions. The total count of cells of each mouse is 30 0000. We have successfully classified them into ten sub-types for wild-type mice, and now able to detect any possible potential mutant cellular phenotype based on four biomarkers from our historical controls. We are also able to check whether these subtype cells are multivariate-normally distributed and we found they indeed are.

A Five-Gene Signature and Clinical Outcome in Non-Small-Cell Lung Cancer.

Hsuan-Yu Chen (ISSAS).

Background:
Current staging methods are inadequate for predicting the outcome of treatment of non-small-cell lung cancer (NSCLC). We developed a five-gene signature that is closely associated with survival of patients with NSCLC.
Methods:
We used computer-generated random numbers to assign 185 frozen specimens for microarray analysis, real-time reverse-transcriptase polymerase chain reaction (RT-PCR) analysis, or both. We studied gene expression in frozen specimens of lung cancer tissue from 125 randomly selected patients who had undergone surgical resection of NSCLC and evaluated the association between the level of expression and survival. We used risk scores and decision-tree analysis to develop a gene-expression model for the prediction of the outcome of treatment of NSCLC. For validation, we used randomly assigned specimens from 60 other patients.
Results:
Sixteen genes that correlated with survival among patients with NSCLC were identified by analyzing microarray data and risk scores. We selected five genes (DUSP6, MMD, STAT1, ERBB3, and LCK) for RT-PCR and decision-tree analysis. The five-gene signature was an independent predictor of relapse-free and overall survival. We validated the model with data from an independent cohort of 60 patients with NSCLC and with a set of published microarray data from 86 patients with NSCLC.
Conclusions:
Our five-gene signature is closely associated with relapse-free and overall survival among patients with NSCLC.

A Pattern Recognition Approach to Infer Genetic Networks.

Cheng-Long Chuang (Institute of Biomedical Engineering, National Taiwan University, ISSAS),
Chung-Ming Chen (Institute of Biomedical Engineering, National Taiwan University) and
Grace S. Shieh (ISSAS)

Motivation: Inferring genetic interactions is of interest since it sheds light on important biochemical pathways. From a group of experiments-confirmed genetic interactions, we observed that paired gene expression curves of transcriptional compensatory interactions often were complementary (anti-similar) whereas those of transcriptional diminished interactions looked similar. This motivated us to develop a pattern recognition approach (called PARE) to infer genetic networks from time course microarray gene expression data (MGED).
METHODS: PARE learns paired gene expression patterns from known genetic interactions, either confirmed by biological experiments or from published literature. Specifically, PARE extracts low order characteristics of the nonlinear paired curves, and integrates an optimization algorithm to train the decision score by MGED of known interactions. Subsequently, PARE can predict unknown gene interactions of similar nature.
RESULTS: Utilizing yeast MGED in Spellman et al. (1998), PARE predicted 112 pairs of genetic interactions and 77 pairs of transcriptional interactions (TIs). Checked against qRT-PCR results and published literatures, respectively, the modified true positive rates are 73% (70%) and 71% (69%) with n-fold (3-fold) cross validation, as compared to 52% and 56% of the latest advance in graphical Gaussian models. The false positive rates of predicting TC and TD interactions for gene pairs formed from yeast genome (3052 synthetic sick or lethal gene pairs) are 5% and 11% (9% and 18%), respectively.:

Boosting learning approach to association studies in bioinformatics.

Shinto Eguchi (ISM).

This talk overviews various boost learning algorithms for discovering associations between phenotypes and genomic/proteomic data including protein, gene expressions and single nucleotide polymorphisms. The problem of high-dimensional data and small sample is commonly faced in analyzing observations from biotechnological experiments. Several approaches to such over-expressed data in boosting and other classification methods for challenging the problem are discussed.

Dimension reduction with positive definite kernels.

Kenji Fukumizu (ISM),
Francis, R. Bach and
Michael I. Jordan.

We present a new methodology for sufficient dimension reduction (SDR). Our methodology derives directly from a formulation of SDR in terms of the conditional independence of the covariate X from the response Y, given the projection of X on the effective directions for regression (EDR). We show that this conditional independence assertion can be characterized in terms of conditional covariance operators on reproducing kernel Hilbert spaces and we show how this characterization leads to an M-estimator for the EDR space. The resulting estimator is shown to be consistent under weak conditions; in particular, we do not have to impose linearity or ellipticity conditions of the kinds that are generally invoked for SDR methods. We also present empirical results showing that the new methodology is competitive in practice.

Biological Information fusion for estimating gene networks via Bayesian modeling and computation.

Tomoyuki Higuchi (ISM),
Seiya Imoto (Human Genome Center, Institute of Medical Science, University of Tokyo),
Rui Yamaguchi (Human Genome Center, Institute of Medical Science, University of Tokyo),
Ryo Yoshida (ISM) and
Satoru Miyano (Human Genome Center, Institute of Medical Science, University of Tokyo).

Gene networks via Bayesian modeling and computation Bayesian approach has matured and improved in several ways for the last two decades. In particular, improvements in computing speed have challenged us to compute Bayesian inference for more complicated models on larger datasets. Information fusion via Bayesian approach is the process of fitting a probability model to a set of data and knowledge, and summarizing the result by a probability distribution on the parameters of the model. We would like to demonstrate our successful applications of Bayesian approach to biological information fusion with DNA microarray gene expression data for nonlinear modeling of gene network.
The first example is a statistical method for estimating a gene network based on Bayesian networks from microarray gene expression data together with biological knowledge including protein-protein interactions, protein-DNA interactions, binding site information, existing literature and so on. The other example is a research project aimed at analyzing the time-course DNA microarray gene expression data with state space model (SSM). This model is extended and generalized to SSM with Markov switching for estimating time dependent gene network structure, because none of the dynamic models such as differential equations, SSM, and dynamic Bayesian networks can cope with time-dependency of the network structure. Finally, we would like to address several possibilities of applying the Bayesian information fusion method to estimate gene networks, and discuss a future direction in the application of computational methods in biological problems.

QTL Analysis With the Boosting Features for the Shape Variation of F2-Mice.

Haque Mollah (The Institute of Statistical Mathematics).

Genes controlling quantitative traits of organisms are termed as quantitative trait loci (QTL). QTL analysis finds relationship between phenotype and genotype of organisms. Most of the QTL analysis algorithms requires univariate measurements of a phenotype. Quantitative traits such as hypertension, diabetes, height, body mass index (BMI), egg and milk production in animals are the examples of univariate phenotype. However, the body shape or a specific part of body shape phenotype of an organism depends on the multivariate measurements. In the case of multivariate phenotype, we are considering each of 3 cases of (1) first principal components (2) AdaBoost scores for classification and (3) most influential feature for classification by AdaBoost, as the representative of multivariate shape phenotype for QTL analysis. We are also discussing the advantage of AdaBoost (supervised) over the PCA (unsupervised) for QTL analysis using both mice mussel shape data and lower jaws shape data from the view points of Mendalian phenotype distributions.

A unified method for detecting single feature polymorphisms and gene expression level differences.

Hironori Fujisawa (ISM),
Youko Horiuchi (National Institute of Genetics),
Yoshiaki Harushima (National Institute of Genetics),
Shinto Eguchi (ISM),
Takako Mochizuki (National Institute of Genetics),
Takayuki Sakaguchi (ISM) and
Nori Kurata (National Institute of Genetics).

Affymetrix GeneChip expression arrays were high-density oligonucleotide microarrays that were initially designed to monitor genome-wide expression profile. Recently, the expression arrays have also been used for detecting thousands of nucleotide polymorphisms, which are called single feature polymorphisms (SFPs).
These purposes have been achieved independently. However, if we inadequately estimate the expression level, then we will poorly detect the SFP, and vice versa. These purposes are closely linked. Therefore, it is desirable to simultaneously deal with these purposes.
for this purpose, we construct a statistical model on two kinds of microarray data and furthermore adopt a robust procedure for parameter estimation and testing because the SFP can be regarded as an outlier.
The proposed method was examined by mRNA hybridization data from two fully sequenced rice cultivar, japonica rice "Nipponbare" and indica one "93-11", to Affymetrix Rice Genome Array. We obtained very satisfactory results. For example, when the proposed method was applied to synthetic data mimicking the original data with high signal intensity, we could detect the SFPs with more than 90% sensitivity and less than 15% false positive rate.

Detecting Activated Transcription Factors from Gene Expression Data of Mice Treated with Kampo Medicine by Statistical Absolute Evaluation Method.

Rui Yamaguchi (University of Tokyo),
Masahiro Yamamoto (Keio University),
Seiya Imoto (University of Tokyo),
Masao Nagasaki (University of Tokyo),
Ryo　Yoshida (ISM),
Kenji Tsuiji (Keio University),
Atsushi Ishige (Keio University),
Hiroaki Asou (Tokyo Metropolitan Institute of Gerontology),
Kenji Watanabe (Keio University) and
Satoru Miyano (University of Tokyo).

We propose an approach to identify activated transcription factors from gene expression data using a statistical test. Applying the method, we can obtain a synoptic map of　transcription factor activities which helps us to easily grasp the system’s behavior. As a real data analysis, we use a case-control experiment data of mice treated by a drug of Kampo medicine remedying degraded myelin sheath of nerves in central nervous system. Kampo medicine is Japanese traditional herbal medicine. Since the drug is not a single chemical compound but extracts of multiple medicinal herb, the effector sites are possibly multiple. Thus it is hard to understand the action mechanism and the system’s behavior by investigating only few highly expressed individual genes. Our method gives summary for the system’s behavior with various functional annotations, e.g. TFAs and gene ontology, and thus offers clues tounderstand it in more holistic manner.

Genomic Data Assimilation for Inferring Gene Regulatory Networks from Gene Expression Profiles.

Ryo Yoshida (ISM).

Molecular regulatory networks of a living cell are comprised of a series of several biochemical reactions, e.g. phosphorylation and binding of protein molecules, gene regulations by transcription factors, and ubiquitin-proteasome system. In silico modeling of biological networks, based on biochemical rate equations, provides a rigorous tool for unraveling the complex machinery of molecular regulations. Currently, we are developing the statistical technologies of data-driven construction of in silico network models. The method involves the following process:(1) Model building with Hybrid Functional Petri Net (HFPN) based on a consensus knowledge of biological system (2) Estimation of model parameters (biochemical rate constant) with Bayesian regularizer (3) Model evaluation and selection (4) Remodeling where needed. We address these tasks based on the generalized stata space models. One main task of network profiling is to estimate the effective values of biochemical rate constant that are difficult to measure directly in vivo. To this end, we exploit time course measurements of gene expressions. In silico network models usually describe regulatory mechanisms of protein-level activities that are unobserved from gene expression measurements directly, thereby, yielding ill-posedness in the system identification.To avoid such an ill-posed problem, we develop the Bayesian regulalizer which incorporates the external biological knowledge into the parameter estimation algorithm. Another important issue that we consider is the statistical evaluation and creation of hypothetical in silico models. We present a new Bayesian information theoretic-based measure to evaluate the predictability and biological robustness of the constructed model. This results in finding of the inconsistencies between the constructed model and experiments indicating aspects of the mechanism that require model revision.

Boosting ordinary class labels in statistical pattern recognition.

Mari Pritchard (The Institute of Statistical Mathematics).

We propose a boosting method for ordinal data classification problem. The methods consists a loss function for the ordinal data classification named “all-threshold”, which is introduced by Srebro and Rennie (2005). The “all-threshold” loss function imposes additional penalties which ensure that the all thresholds are ordered appropriately. In this poster we give the detail of the algorithm and show experimental results.

Statistical inference of transcriptional module-based gene networks fromtime course gene expression profiles by using state space models.

Osamu Hirose (University of Tokyo),
Ryo Yoshida (ISM),
Seiya Imoto (University of Tokyo),
Rui Yamaguchi (University of Tokyo),
Tomoyuki Higuch (ISM),
D. Stephen Charnock-Jones (Cambridge University),
Cristin Print (University of Auckland) and
Satoru Miyano (University of Tokyo).

Statistical inference of gene networks by using time course microarray gene expression profiles is an essential step towards understanding the temporal structure of gene regulatory mechanisms. Unfortunately, most of the current studies have been limited to analysing a small number of genes because the length of time course gene expression profiles is fairly short. One promising approach to overcome such a limitation is to infer gene networks by exploring the potential transcriptional modules which are sets of genes sharing a common function or are involved in the same pathway. In this research, we present a novel approach based on the state space model to identify the transcriptional modules and module-based gene networks simultaneously. The state space model has the potential to infer large scale-gene networks, e.g. of order 103, from time course gene expression profiles. Particularly, we succeeded in identifying a cell cycle system by using the gene expression profiles of Saccharomyces cerevisiae in which the length of the time course and number of genes were 24 and 4382, respectively. However, when analyzing more short time course data, e.g. of length 10 or less, the parameter estimations of the state space model often fail due to overfitting.To extend the applicability of the state space model, we provide a way of using the technical replicates of gene expression profiles, which are often measured in duplicate or triplicate. The use of technical replicates is important for achieving highly-efficient inferences of gene network with short time course data. The potential of the proposed method has been demonstrated through the time course analysis of the simulation data and time course microarray data of human umbilical vein endothelial cells (HUVECs) undergoing growth factor deprivation-induced apoptosis.