Combining Functional and Linkage Disequilibrium Information in the Selection of Tag SNPs

Sham P.C.^1,2,3,*, Ao, S.I.⁴, Kwan J.S.H.², Kao P.², Cheung F.², Fong P.Y.², Ng, M.K.⁵

(1) Department of Psychiatry, University of Hong Kong
(2) Genome Research Centre, University of Hong Kong
(3) SGDP Centre, Institute of Psychiatry, King's College London
(4) Department of Mathematics, University of Hong Kong
(5) Department of Mathematics, Hong Kong Baptist University

A method for tag-SNP selection is described that allows the user to specify variable weights for different genomic regions or SNPs. Tag-SNPs are selected such that a SNP with specified weight C will have a minimum R^2 of C with at least one tag-SNP. This flexible feature is useful for researchers who wish to prioritize genomic regions in an association study, and to optimize study design under budgetary constraints.

How does the tag SNP selection work
WCLUSTAG formulation
Experimental results
Download the program
How to use it
Acknowledgements
References
Supplementary information

How does the tag SNP selection work

There are two main approaches to selecting genetic markers in association studies of complex diseases. The first is a direct, functional, or gene-based approach, in which polymorphisms are selected if they cause a change in the amino acid sequence or expression of candidate genes. The second is an indirect approach in which markers in a region or the whole genome are systematically screened, on the basis that they may be in linkage disequilibrium (LD) with disease-related functional variants. For the second approach, efficiency can be improved by recognizing the redundancy between near-by markers through the presence of LD, and the selection of a subset of informative SNPs, called tag SNPs, for analysis (Johnson et al, 2001). A large number of methods for selecting tag-SNPs have been proposed (Halldόrsson et al, 2005).

In this report, we propose a modification of tag SNP selection algorithms that takes account of functional as well as LD information. More importance is attached to some SNPs than others, based on their positions within the coding or regulatory regions or splice sites. We also describe further modifications to address other practical issues: some SNPs may be more readily assayed than others under the proposed genotyping platform, and some SNPs may have been already genotyped in the sample. We have modified our recently described tag-SNP selection method (Ao et al, 2005) to incorporate functional and other information.

Minimax clustering

Our method for tag-SNP selection is based on agglomerative hierarchical clustering, which starts with a square matrix of pair-wise distances between the objects to be clustered. The two clusters with the smallest inter-cluster distance are successively merged until all the objects have been merged into a single cluster. For two SNPs, an appropriate measure of distance is 1-R^2, where R^2 is the squared correlation between the SNPs. Different forms of agglomerative clustering differ in the definition of the distance between two clusters, each of which may contain more than one object. We previously proposed the following definition for inter-cluster distance:
For each SNP belonging to either cluster, find the maximum distance (i.e. 1-R^2) from it to all the other SNPs in the two clusters
The smallest of these maximum distances is defined as the distance between the two clusters.
The corresponding SNP is defined as the tag-SNP of the newly merged cluster

In this method, called minimax clustering, setting a cutoff merging distance of C for terminating the algorithm would ensure that no SNP is further than C away from the tag-SNP in its cluster. This method was implemented in the CLUSTAG program. Also implemented in this program are two other tag SNP selection procedures, a complete linkage clustering method (Byng et al, 2003) and a set-cover algorithm similar to the method described in Carlson et al (2004). Our previous results show that complete linkage clustering results in a greater number of clusters and is therefore less efficient, while the set-cover method is similar to minima clustering in terms of the number of tag SNPs but produces less compact clusters (as measured by the average of the distances,1-R^2 between all SNPs and their assigned tag SNPs).

WCLUSTAG formulation

We have now extended this method to take into account other factors in the selection of tag SNPs. The basic idea is to allow of the cut-off distance C to be variable among markers, as specified by the user to reflect factors concerning the SNP; these might include positional and functional considerations, as well as more practical issues such as assay quality and whether include the SNP has been genotyped previously. For instance, C might be set at a high value (e.g. 0.8) for SNPs within the coding or regulatory regions of genes expressed in a certain tissue, and a low value (e.g. 0.4) for other SNPs. In a sense, more weight is given to some SNPs than others.

One complication of this modification is the asymmetry between two SNPs with different values of C. For example, if one SNP is coding and therefore given a C of 0.8, and the other is non-coding and therefore given a C of 0.4, and the R2 between the two SNPs is 0.6, then the first SNP is able to serve as tag-SNP for the second SNP, but not the reverse. The modification of the algorithm necessary to deal with this complication is simplified by the ability of the clustering program to handle an asymmetric distance matrix, in which the distance from object i to object j can differ from the distance from object j to object i. Because of this, the desired extension can be achieved by the following modifications to the clustering algorithm:
For each marker, a user-defined value of C is provided in an input file
The distance from marker i to marker j is defined as C[j]-R[ij] where C[j] is the value of C specified for marker j. If C[j]-R[ij]<0, then marker i can serve as a tag SNP for marker j.
This asymmetric distance matrix is subjected to the minimax clustering method with the cut-off merging distance set at 0. In order words, a cluster is formed if there is a tag SNP which has distance 0 or less with every member of the cluster.

Other modifications of this algorithm can be used to ensure the inclusion of SNPs that have been genotyped, and the exclusion of those that cannot be assayed. These modifications involve changing the values of the certain elements of the correlation matrix [R[ij]]. Thus, if marker t has been already genotyped, then all elements of column t of the matrix can be set to 0, except the diagonal element which remains 1. This setting ensures that marker t cannot be tagged by any marker except itself, so that it must be included as a tag SNP in the clustering algorithm. Similarly, if marker t is problematic for assay design, then all the elements of row t of the matrix can be set to 0. This setting ensures that marker t can never serve as a tag SNP. As the clustering proceeds, some of these SNPs will be merged into clusters, but others may remain untagged when the algorithm is terminated. At this stage it is necessary to check each of the remaining markers whether they have a potential tag SNP, and to include these in the list of tag SNPs.

And the details of how to handle the cases of some already-genotyped SNPs and some non-desirable SNPs are available here with MS Word format or here with pdf format.

The same principles of weighting can be applied also to the set-cover algorithm �V marker i can serve can tag SNP for marker j if the condition C[j]-R[ij]<0 is fulfilled. The algorithm would initially select all SNPs that have been already genotyped, and remove the markers tagged by these SNPs. Then the greedy algorithm proceeds as usual, except the exclusion of SNPs that have problems with assay design from the set of possible tag SNPs. As with the clustering algorithm, some "non-designable" SNPs may be left untagged at the end, and these are checked to see if they have potential tag SNPs.

Experimental results

We have made the necessary modifications to implement the weighting in all two tag-SNP selection algorithms in the WCLUSTAG program. To illustrate the performance of the new algorithms, they were applied to the CEPH sample genotype data from the International Haplotype Map Project. The ENCODE regions were selected because genotyping were undertaken for all known SNPs in these regions. Intragenic regions were identified from the start and end points of the coding sequences for the 33K Ensembl genes in NCBI build 34. Intragenic SNPs are given a C weighting of 0.8, and other SNPs 0.4. The compression ratios (the number of tag-SNPs over the total number of SNPs) of the various ENCODE regions are compared with the original procedure which used a uniform C value of 0.8. Our results show that there can be a further 35.2% saving with our weighted minimax algorithm, and 35.9% with the set cover method (Table 1). We also explored the impact of using different weighting schemes. Some additional saving can be obtaining by lowering the weights for either intragenic or other SNPs, although the compression ratios remain in the region of 0.2 (Table 2).

The method and program described here allow the user to prioritize different SNPs and genomic regions in a systematic association screen, depending on prior genomic and disease data available either through public databases or recent experiments. Any weight values below 1 can be specified for any SNP or region, allowing the user to explore different designs under a fixed budget. The utility of the program will be greatly enhanced when detailed SNP maps are available from the International Haplotype Map Project, and progressively more detailed functional information becomes available in public genome databases.

Table 1. Properties of the tag SNP selection algorithms, weighted with 0.8 for gene regions and 0.4 for other regions.

Encode Region (SNP No.)	Compression (Uniform)			Compression (Weighted)
Encode Region (SNP No.)	Complete	Minimax	Set Cover		Minimax	Set cover
2A (519)	0.277	0.245	0.247		0.104	0.104
2B (595)	0.291	0.255	0.261		0.197	0.198
4 (665)	0.242	0.211	0.209		0.089	0.089
7A (417)	0.314	0.281	0.281		0.149	0.139
7B (463)	0.186	0.166	0.171		0.114	0.114
7C (433)	0.240	0.217	0.215		0.189	0.185
8A (364)	0.269	0.245	0.245		0.190	0.190
9 (258)	0.360	0.318	0.314		0.221	0.225
12 (454)	0.260	0.227	0.227		0.167	0.163
18 (350)	0.283	0.254	0.254		0.186	0.189
Average	0,267	2.237	0.238		0.154	0.153
Additional saving	---	---	---		35.2%	35.9%

Table 2. Effect of weighting scheme (intragenic versus other SNPs) on the comparison ratios for tag-SNP selection algorithms in the Chromosome 9 Encode data.

Weighted Ratio	Compression
Weighted Ratio		Minimax	Set cover
0.8 : 0.4		0.221	0.225
0.8 : 0.3		0.198	0.198
0.8 : 0.5		0.240	0.244
0.7 : 0.4		0.217	0.221
0.9 : 0.4		0.240	0.244
0.8 : 0.8		0.318	0.314

Table 3. The number of SNPs in the intragenic regions and the other regions. The average ratio of the SNPs in the intragenic regions to the overall SNPs is 32.3%.

SNPs no.	SNPs in intragenic regions	SNPs in other regions
chr2A	0	519
chr2B	273	322
chr4	0	665
chr7A	21	396
chr7B	159	304
chr7C	299	134
chr8A	203	161
chr9	66	192
chr12	180	274
chr18	167	183

Download the program

The program WCLUSTAG is available for free download. .

Package v1 (slim version): click here for downloading the zipped file (package v1) of the WCLUSTAG with the sample files, the necessary directories, the DOS script and the C script for setting the weights of the SNPs, so that the user can run the program after unzipping this file. The C script change_chr9_gene.cpp (can be renamed to .c file if one like) can be complied with any C complier in Windows or Unix environment. The names of the input files for this script can be changed to any name one likes in the declaration section of the script. The default ones are the chr9_raw_processed.txt for the SNP position and MAF information, the chr9_LD_unweighted.txt for the LD correlation data, and the chr9_gene_processed.txt for the gene region and non-gene region information. The default setting for the weights are 0.8 for the gene region and 0.4 for the non-gene region. Users can change these values in the C script and then set the threshold of the WCLUSTAG script to the weight value of the gene region.

Package v2 (full version): click here for downloading the zipped file (package v2) of the WCLUSTAG with the sample files, the necessary directories, the DOS script for WCLUSTAG and the C script for setting the weights of the SNPs. There is also a sample all-in-one DOS script (execute_all.bat) of calling the WCLUSTAG script, handling different input file names and calling the CLUSTAG in one command. The differences between package v1 and package v2 is that, in package v2, the user can set different weights for each individual SNPs as they like. Also, there are the capacities for handling the cases with some SNPs data already done. Also, it can handle the case with some SNPs that are set as undersirable.

Sample program output for the minimax clustering are available here: the WCLUSTAG output.

Note: The java program TaggingSetChooser.jar here is based on our previous CLUSTAG program, but it is enhanced with the capability of handling the asymmetric correlation matrix.

How to use it

The sample DOS script (for the computer with the DOS environment like the MS Windows DOS command window) is available for download and the script for Mac and Linux for download. A sample C script for assigning the different weights to the intragenic regions and other regions can be downloaded here.

The path "data" and "results" can be changed. In case of no change, then the user should create these two folders and the "temp" folder under the directory containing the program file. The file names *.txt, *.members and *.out can also be set by the users. The parameters "threshold" and the "scale" can be set by the users. The "threshold" is for the threshold value used by the algorithms and the "scale" is for the scale of the html output.

The program WCLUSTAG (package v1) requires two input files, one for the LD relationship between each SNPs pair and the other for each SNP's position and MAF data. The data columns in the input files are separated by one space " ". And the LD data can be generated, for example, with the Haploview. The sample LD file and the sample MAF file (in ascending order of the SNP position) are contained in the above zipped file. The C script can assign different weights to the SNPs. And the source code of this script is provided so that the user can set different rules for deciding the weights.

There are more input files for the package v2. They are the ld.txt, maf.txt, weight.txt, done.txt, undesirable.txt, and SNP_names.data files. The done.txt file contains the names of the SNPs that have been tested in previous experiments, and the undesirable.txt file contains the names of the SNPs that are set as undesirable. Both of these files can be set to be empty if there are not such SNPs. The SNP_names.data contains the names of the SNPs that appears in the maf file in the ascending order of their position. The file can also be produced with the CLUSTAG, where it is produced under the "data" directory. The output file ld_processsed_temp.txt contains the information of the SNPs that are undersirable and that can (can not) be tagged by other SNPs. Also, it has the weighted correlation matrix displayed in a matrix format. The ld_processed.txt and maf_processed.txt are files that are to be used as the input files for the script execute_test.bat and they should be moved to the "data" directory. The variable max_ld in the C script refers to the maximum value of the weights of the SNPs. The threshold value in the execute_test.bat should be set to this value too. The C script can handle a total of 2000 SNPs in each run by default. This total number can be changed by replacing the matrix size variables 2000 in the declaration section to another values.

Acknowledgements

The ENCODE genotype data were downloaded from HAPMAP's site www.hapmap.org on Jun 30, 2004 and is based on NCBI build 34. The gene coordinates data were for Ensembl genes from NCBI build 34, downloaded from the UCSC Genome Browser http://genome-test.cse.ucsc.edu/cgi-bin/hgTables choosing human genome and July 1993 assembly. The work is supported by small project grant to P.C. Sham from The University of Hong Kong.

References

Ao, I., et al. (2005) CLUSTAG: Hierarchical Clustering and Graph Methods for Selecting Tag SNPs. Bioinformatics. In Press.

Barrett, J.C., et al. (2004) Haploview: Analysis and Visualization of LD and Haplotype Maps. Bioinformatics, Advance Access.

Byng, M., et al. (2003) SNP Subset Selection for Genetic Association Studies. Annals. Of Human Genetics, 67, 543-556.

Carlson, C., et al. (2004) Selecting a Maximally Informative Set of Single-Nucleotide Polymorphisms for Association Analyses Using Linkage Disequilibrium. Am. J. Hum. Genet. 74:106-120.

Johnson, G., et al. (2001) Haplotype Tagging for the Identification of Common Disease Genes. Nat Genet 29(2):233-7

Supplementary Information

Note on Clustering Methods: Two more papers of the clustering methods and their comparison for the gene expression data are listed below:

Mcshane L.M., Radmacher M.D., Freidlin B., Yu R., Li M.C., and Simon R. (2002) Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics. v11, 1462-9.

Datta S. and Data S. (2003) Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics. v19, 459-466.

Note on the minimax algorithm: It is pointed out that there exists a parallel from topology with the minimax algorithm. It is the Hausdorff distance, which measures the maximum distance of a set to the nearest point in the other set. More details can be found at:

Rucklidge W. (1996) Efficient visual recognition using the Hausdorff distance. Springer.

Note on Run Time: The complexity of the clustering methods are of order O(n²). With the run time information in our table of several hundred SNPs and this complexity information, the users can estimate roughly the expected run time for their samples before the program's execution. The run time will not be an issue for data of several hundred to a hundred thousand SNPs. But, it will be a constraint when we are studying the whole genome at one time, when the size may be of several million SNPs. This is an area of further work as the HAPMAP project is producing the whole genome haplotype information.

Note on Graph-Method Algorithm: Two more papers from different journals are cited, which is about the graph-method algorithm. The one by Johnson (1973) studied the error bound of the algorithm and the one by Fujito (2001) studied the case of weighted edges.

Johnson D. (1973) Approximation Algorithms for Combinatorial Problems. Annual ACM Symposium on Theory of Computing. 38-49.

Fujito T. (2001) On approximability of the independent/connected edge dominating set problems. Information Processing Letters. v79, 261-266.

Note on the CLUSTAG: The result is published on:

Ao, S.I., Kevin Yip, Michael K. Ng, David Cheung, Pui-Yee Fong, Ian Melhado and Pak C Sham (2005) CLUSTAG: Hierarchical Clustering and Graph Methods for Selecting Tag SNPs. Bioinformatics. v21(8), 1735-1736.

Note on the C script of WCLUSTAG: The script in our download section will set the new correlation c'[i] to C[max] - ( C[j]-R[ij] ), where C[max] is the maximum of the C's in the chromosome region. And, the threshold in the WCLUSTAG script is set at C[max] too.

Note on the Clustering Methods and Result Comparison: Methods for tag SNP selection based on established multivariate statistical techniques may offer some advantages. Byng et al. (2003) proposed the use of single and complete linkage hierarchical cluster analysis to select tag SNPs. Hierarchical clustering starts with a square matrix of pairwise distances between the objects to be clustered. Our method in the Clustag has been compared with an algorithm based on the NP-complete minimum dominating set of the set-cover problem, similar to the greedy algorithm developed by Carlson et al. (2004) in the previous study Ao et al. (2005).

*To whom correspondence should be addressed.