| Peer-Reviewed

Methods for Evaluating Agglomerative Hierarchical Clustering for Gene Expression Data: A Comparative Study

Received: 5 December 2015     Accepted: 14 December 2015     Published: 30 December 2015
Views:       Downloads:
Abstract

Microarray is already well established techniques to understand various cellular functions by profiling transcriptomics data. To capture the overall feature of high dimensional variable datasets in microarray data, various analytical and statistical approaches are already developed. One of the most widely used Agglomerative Hierarchical Clustering (AHC) methods is the cluster analysis of gene expression data; however, little work has been done to compare the performance of clustering methods on gene expression data, where some authors used three or four AHC methods and some others used at most five AHC methods. All of the authors concretely suggested complete linkage method to further researchers to determine the best method for clustering their gene expression data. This paper compared the performance of seven AHC methods for clustering gene expression data with respect to five major proximity measures. We used corrected Rand (cR) Index to compare the performance of each clustering method. To illustrate the results, we found that the clustering method Ward exhibited the best performance among all of the AHC methods as well as the proximity measure Cosine performed better in comparison to all the other measures in both type of Affymetrix and cDNA datasets.

Published in Computational Biology and Bioinformatics (Volume 3, Issue 6)
DOI 10.11648/j.cbb.20150306.12
Page(s) 88-94
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2015. Published by Science Publishing Group

Keywords

Agglomerative Hierarchical Clustering, Proximity Measures, Corrected Rand Index, Gene Expressions Data

References
[1] Brown M P and Bostein D (1999); Exploring the new world of genome with DNA microarrays. Nature Genetics, 21: 33-37.
[2] Quackenbush J (2001); Computational analysis of cDNA microarray data. Nature Reviews. 6(2):418-428.
[3] Slonim D (2002); From patterns to pathways: gene expression data analysis comes of age. Nature Genetics. 32:502-508.
[4] Monti S, Tamayo P, Mesirov J, Golub T (2003); Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data; Machine Learning. 52:91-118.
[5] Cunningham K M and Ogilvie J C (1972); Evaluation of hierarchical grouping techniques: A preliminary study. The Computer Journal, 15: 209–213.
[6] Hubert L (1974); Approximate evaluation techniques for the single-link and complete ¬¬link hierarchical clustering procedures. Journal of the American Statistical Association, 69, 698–704.
[7] Baker F B (1974); Stability of two hierarchical grouping techniques – Case I: Sensitivity to data errors. Journal of the American Statistical Association, 69: 440–445.
[8] Kuiper F K and Fisher L (1975); A Monte Carlo comparison of six clustering procedures. Biometrics, 31: 777–783.
[9] Blashfield R K (1976); Mixture model tests of cluster analysis: Accuracy of four agglomerative hierarchical methods. The Psychological Bulletin, 83: 377–388.
[10] Hands S and Everitt B (1987); A Monte Carlo study of the recovery of cluster structure in binary data by hierarchical clustering techniques. Multivariate Behavioral Research, 22: 235–243.
[11] Johnson R A and Wichern D W (2002); Applied Multivariate Statistical Analysis. Upper Saddle River, NJ: Prentice Hall.
[12] Jaskowiak P A, Campello R J G B and Costa I G (2013); Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis, Computational Biology and Bioinformatics. 10 (4):845-857.
[13] Costa I G, Carvalho F A D and Souto M C P D (2004); Comparative Analysis of Clustering Methods for Gene Expression Time Course Data. Genetics and Molecular Biology, 27: 4623-4631.
[14] Datta S and Datta S (2006); Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes. BMC Bioinformatics, 7: 397.
[15] Kerr G, Ruskin H J, Crane M and Doolan P (2008); Techniques for clustering gene expression data. Comput Biol Med, 38(3): 283-293.
[16] Geetha T and Michael A (2010); Enhanced Hierarchical Clustering for Gene Expression data. International Journal of Computer Applications 1(20):92–98.
[17] Milligan G W (1980); An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45: 325–342.
[18] Sasirekha K and Baby P (2013); Agglomerative Hierarchical Clustering Algorithm-A Review. International Journal of Scientific and Research Publications, 3(3):01-03.
[19] Frakes W B and Baeza-Yates R (1992); Information Retrieval: Data Structures and Algorithms, Upper Saddle River, NJ: Prentice Hall.
[20] Guojun G, Chaoqun M and Jianhong W (2007); Data Clustering: Theory, Algorithms, and Applications, ASA-SIAM Series on Statistics and Applied Probability, SIAM, Philadelphia, ASA, Alexandria, VA.
[21] Gentleman R, Ding B, Dudoit S and Ibrahim J (2005); Bioinformatics and Computational Biology Solutions Using R and Bioconductor Statistics for Biology and Health, 189-208.
[22] Pablo A Jaskowiak, Ricardo J G B Campello and Ivan G Costa (2013); Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 10(4):845-857.
[23] Jain A K and Dubes R C (1988); Algorithms for clustering data Prentice Hall.
[24] Milligan G W and Cooper M C (1988); A study of standardization of variables in cluster analysis. Journal of Classification, 5:181-204.
[25] Marcilio C P de Souto, Ivan G Costa, Daniel S A de Araujo, Teresa B Ludermir and Alexander Schliep (2008); Clustering cancer gene expression data: a comparative study. BMC Bioinformatics, 01-14.
[26] Anderberg M (1973); Cluster analysis for applications. New York: Academic Press.
[27] Daxin J, Chun T, and Aidong Z (2004); Cluster Analysis for Gene Expression Data: A Survey, IEEE Transactions on Knowledge and Data Engineering, 16 (11):1370-1386.
[28] Eldesoky, A.E, M. Saleh, N.A. Sakr (2009); Novel Similarity Measure for Document Clustering Based on Topic Phrase, Interna-tional Conference on Networking and Media Convergence24: 92-96.
[29] Myatt, Glenn J, “Making Sense of Data II”, 2009, Wiley, Canada.
Cite This Article
  • APA Style

    Md. Bipul Hossen, Md. Siraj-Ud-Doulah, Aminul Hoque. (2015). Methods for Evaluating Agglomerative Hierarchical Clustering for Gene Expression Data: A Comparative Study. Computational Biology and Bioinformatics, 3(6), 88-94. https://doi.org/10.11648/j.cbb.20150306.12

    Copy | Download

    ACS Style

    Md. Bipul Hossen; Md. Siraj-Ud-Doulah; Aminul Hoque. Methods for Evaluating Agglomerative Hierarchical Clustering for Gene Expression Data: A Comparative Study. Comput. Biol. Bioinform. 2015, 3(6), 88-94. doi: 10.11648/j.cbb.20150306.12

    Copy | Download

    AMA Style

    Md. Bipul Hossen, Md. Siraj-Ud-Doulah, Aminul Hoque. Methods for Evaluating Agglomerative Hierarchical Clustering for Gene Expression Data: A Comparative Study. Comput Biol Bioinform. 2015;3(6):88-94. doi: 10.11648/j.cbb.20150306.12

    Copy | Download

  • @article{10.11648/j.cbb.20150306.12,
      author = {Md. Bipul Hossen and Md. Siraj-Ud-Doulah and Aminul Hoque},
      title = {Methods for Evaluating Agglomerative Hierarchical Clustering for Gene Expression Data: A Comparative Study},
      journal = {Computational Biology and Bioinformatics},
      volume = {3},
      number = {6},
      pages = {88-94},
      doi = {10.11648/j.cbb.20150306.12},
      url = {https://doi.org/10.11648/j.cbb.20150306.12},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.cbb.20150306.12},
      abstract = {Microarray is already well established techniques to understand various cellular functions by profiling transcriptomics data. To capture the overall feature of high dimensional variable datasets in microarray data, various analytical and statistical approaches are already developed. One of the most widely used Agglomerative Hierarchical Clustering (AHC) methods is the cluster analysis of gene expression data; however, little work has been done to compare the performance of clustering methods on gene expression data, where some authors used three or four AHC methods and some others used at most five AHC methods. All of the authors concretely suggested complete linkage method to further researchers to determine the best method for clustering their gene expression data. This paper compared the performance of seven AHC methods for clustering gene expression data with respect to five major proximity measures. We used corrected Rand (cR) Index to compare the performance of each clustering method. To illustrate the results, we found that the clustering method Ward exhibited the best performance among all of the AHC methods as well as the proximity measure Cosine performed better in comparison to all the other measures in both type of Affymetrix and cDNA datasets.},
     year = {2015}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Methods for Evaluating Agglomerative Hierarchical Clustering for Gene Expression Data: A Comparative Study
    AU  - Md. Bipul Hossen
    AU  - Md. Siraj-Ud-Doulah
    AU  - Aminul Hoque
    Y1  - 2015/12/30
    PY  - 2015
    N1  - https://doi.org/10.11648/j.cbb.20150306.12
    DO  - 10.11648/j.cbb.20150306.12
    T2  - Computational Biology and Bioinformatics
    JF  - Computational Biology and Bioinformatics
    JO  - Computational Biology and Bioinformatics
    SP  - 88
    EP  - 94
    PB  - Science Publishing Group
    SN  - 2330-8281
    UR  - https://doi.org/10.11648/j.cbb.20150306.12
    AB  - Microarray is already well established techniques to understand various cellular functions by profiling transcriptomics data. To capture the overall feature of high dimensional variable datasets in microarray data, various analytical and statistical approaches are already developed. One of the most widely used Agglomerative Hierarchical Clustering (AHC) methods is the cluster analysis of gene expression data; however, little work has been done to compare the performance of clustering methods on gene expression data, where some authors used three or four AHC methods and some others used at most five AHC methods. All of the authors concretely suggested complete linkage method to further researchers to determine the best method for clustering their gene expression data. This paper compared the performance of seven AHC methods for clustering gene expression data with respect to five major proximity measures. We used corrected Rand (cR) Index to compare the performance of each clustering method. To illustrate the results, we found that the clustering method Ward exhibited the best performance among all of the AHC methods as well as the proximity measure Cosine performed better in comparison to all the other measures in both type of Affymetrix and cDNA datasets.
    VL  - 3
    IS  - 6
    ER  - 

    Copy | Download

Author Information
  • Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh

  • Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh

  • Department of Statistics, Rajshahi University, Rajshahi, Bangladesh

  • Sections