Bioinfomatics

Lab 7- Bioinfomatics

Introduction

In the last few decades, advances in molecular biology and the equipment available for research in this field have allowed the increasingly rapid sequencing of large portions of the genomes and proteomes of several species. This deluge of information has necessitated the careful storage, organization and indexing of sequence information. Information science has been applied to biology to produce the field called Bioinformatics.

The simplest tasks used in bioinformatics concern the creation and maintenance of databases of biological information. Nucleic acid sequences (and the protein sequences derived from them) comprise the majority of such databases. While the storage and or organization of millions of nucleotides is far from trivial, designing a database and developing an interface whereby researchers can both access existing information and submit new entries is only the beginning. These are some of the concepts associated with the field of bioinformatics:

	Finding the genes in the DNA sequences of various organisms
	Developing methods to predict the structure and/or function of newly discovered proteins and structural RNA sequences.
	Clustering protein sequences into families of related sequences and the development of protein models.
	Aligning similar proteins and generating phylogenetic trees to examine evolutionary relationships.

The process of evolution has produced DNA sequences that encode proteins with very specific functions. It is possible to predict the three-dimensional structure of a protein using algorithms that have been derived from our knowledge of physics, chemistry and most importantly, from the analysis of other proteins with similar amino acid sequences. Today’s lab is designed as a simple introduction to the field of bioinformatics by looking at some of the online resources that relate to the human genome.

We will be using the National Center For Biotechnology Information’s (NCBI) map viewer of the human genome. Using the NCBI map viewer, we can view two broad categories of maps: genetic maps and physical maps. Both genetic and physical maps provide the likely order of items along a chromosome. However, a genetic map, like an interstate highway map, provides an indirect estimate of the distance between two items and is limited to ordering certain items. One could say that genetic maps serve to guide a scientist toward a gene, just like an interstate map guides a driver from city to city. On the other hand, physical maps mark an estimate of the true distance, in measurements of base pairs, between items of interest.

To continue our analogy, physical maps would then be similar to street maps, where the distance between two sites of interest may be defined more precisely in terms of city blocks or street addresses. Physical maps, therefore, allow a scientist to more easily home in on the location of a gene. An appreciation of how each of these maps is constructed may be helpful in understanding how scientists use these maps to study the "human genome".

Just like interstate maps have cities and towns that serve as landmarks, genetic maps have landmarks known as genetic markers, or "markers" for short. The term marker is used very broadly to describe any observable variation that results from an alteration, or mutation, at a single genetic locus. A marker may be used as one landmark in a map if, in most cases, that stretch of DNA is inherited from parent to child according to the standard rules of inheritance. Markers can be within genes that code for a noticeable physical characteristic such as eye color, or a not so noticeable trait such as a disease. DNA polymorphisms can also serve as molecular markers. These types of markers are typically found within the non-coding regions of genes and are used to detect unique regions on a chromosome. DNA markers are especially useful for generating genetic maps when there are occasional, predictable mutations that occur during meiosis, over many generations, lead to a high degree of variability in the DNA content of the marker from individual to individual.

Early geneticists recognized that genes are located on chromosomes and believed that each individual chromosome was inherited as an intact unit, that is, linked genes did not assort independently during gamete formation. They hypothesized that if two genes were located on the same chromosome, they were physically linked together and were inherited together. We now know that this is not always the case. Studies conducted around 1910 demonstrated that that very few pairs of genes are located close enough together so that they display complete linkage. Pairs of genes were either inherited independently or displayed partial linkage—that is, they were inherited together sometimes, but not always due to nature of recombination during meiosis.

As we have learned, if recombination occurs as a random event, then two markers that are close together should be separated less frequently than two markers that are more distant from one another. The recombination probability (frequency of recombination) between two markers, which can range from 0 to 0.5, increases as the distance between the two markers increases along a chromosome. Therefore, the recombination probability may be used as a surrogate for ordering genetic markers along a chromosome. If you then determine the recombination frequencies for different pairs of markers, you can construct a map of their relative positions on the chromosome. We have just finished such an exercise with our three-point testcross with Drosophila.

But predicting recombination is not so simple as our lab with Drosophila where we used easily observable phenotypic traits. Although crossovers are random, they are not uniformly distributed across the genome or any chromosome. Some chromosomal regions, called recombination hotspots, are more likely to be involved in crossovers than other regions of a chromosome. This means that genetic map distance does not always indicate physical distance between markers.

Despite these qualifications, linkage analysis usually correctly deduces marker order, and distance estimates are sufficient to generate genetic maps that can serve as a valuable framework for genome sequencing. In humans, data for calculating recombination frequencies are obtained by examining the genetic makeup of the members of successive generations of existing families using pedigree analysis. Linkage studies often begin by obtaining blood samples from a group of related individuals. For relatively rare diseases, scientists find a few large families that have many cases of the disease and obtain samples from as many family members as possible. For more common diseases where the pattern of disease inheritance is unclear, scientists will identify a large number of affected families and will take samples from four to thirty close relatives.

DNA is then harvested from all of the blood samples and is screened for the presence, or co-inheritance, of two markers. One marker is usually the gene of interest, generally associated with a physically identifiable characteristic. The other is usually one of the various detectable rearrangements. A computerized analysis is then performed to determine whether the two markers are linked and approximately how far apart those markers are from one another. In this case, the value of the genetic map is that an inherited disease can be located on the map by following the inheritance of a DNA marker (polymorphism) present in affected individuals but absent in unaffected individuals, although the molecular basis of the disease may not yet be understood, nor the gene(s) responsible identified.

Genetic maps are also used to generate the essential backbone, or scaffold, needed for the creation of more detailed human genome maps. These detailed maps, called physical maps, further define the DNA sequence between genetic markers and are essential to the rapid identification of genes. Physical maps can be divided into three general types:

1. chromosomal or cytogenetic maps

2. radiation hybrid (RH) maps

3. sequence maps.

The different types of maps vary in their degree of resolution, that is, the ability to measure the separation of elements that are close together. The higher the resolution, the better the picture. The lowest-resolution physical map is the chromosomal or cytogenetic map, which is based on the distinctive banding patterns observed by light microscopy of stained chromosomes. This is the location you will be identifying and entering into the table at the end of the lab writeup. As with genetic linkage mapping, chromosomal mapping can be used to locate genetic markers defined by traits observable only in whole organisms. Because chromosomal maps are based on estimates of physical distance, they are considered to be physical maps. Yet, the number of base pairs within a band can only be estimated. We will not be using RH maps or sequence maps in today’s lab, but information on them is provided in the appendix.

Comparing the many available genetic and physical maps can be a time-consuming step, especially when trying to pinpoint the location of a new gene. Without the use of computers and special software designed to align the various maps, matching a sequence to a region of a chromosome that corresponds to the gene location would be very difficult. It would be like trying to compare 20 different interstate and street maps to get from a house in Ukiah, California to a house in Beaver Dam, Wisconsin. You could compare the maps yourself and create your own travel itinerary, but it would probably take a long time. Wouldn't it be easier and faster to have the automobile club create an integrated map for you? That is the goal behind NCBI's Human Genome Entrez Map Viewer.

The Entrez Map Viewer provides a graphical display of the available human genome sequence data as well as sequence, cytogenetic, genetic linkage, and RH maps. Map Viewer can simultaneously display up to seven maps, selected from a set of 19, and allows the user access to detailed information for a selected map region. Map Viewer uses a common sequence numbering system to align sequence maps and shared markers as well as gene names to align other maps. You can use NCBI's Map Viewer to search for a gene in a number of genomes, and more genomes are being added regularly. Displays are provided at four levels of detail:

A. Home Page for an organism - summarizes the resources available for that organism

B. Genome View - graphically displays the complete genome as a set of chromosome ideograms (to scale) and allows you to search for terms across the genome, showing the location of the hits on the chromosome ideograms

C. Map View - presents one or more maps of interest for a selected chromosome, aligned to a Master Map that you select, and allows you to view regions of interest at different levels of resolution

D. Sequence View - displays the sequence data for a specific chromosomal region, and graphically depicts the biological features that have been annotated on that region

Procedure:

You will explore the human genome using the NCBI Entrez viewer and fill in the attached table. To do this you need to go to an overview of the human genome at this website: http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi. You should bookmark the website in case you get lost and need to get back to this page quickly.

There are a number of ways to find the location of the genes of interest. One way it to type in the name of the gene you are interested under “search for” then click “find”. This should identify the chromosome with the gene. Then return to the previous page and click on the chromosome you identified (this page has the information you need). You should include the following information in the tables at the end of this lab:

1. The chromosome number in which the gene of interest is found

2. The total number of genes found on the chromosome

3. The total size of the chromosome in MB

4. The location of the centromere of the chromosome in which the gene is found

“A” Acrocentric

“M” Metacentric

“S” Submetacentric

“T” Telocentric

5. The symbol for the gene of interest (this is provided on the chart)

6. The cytological map location of the gene

7. Several sentences describing the function of the gene of interest

Appendix, higher resolution genome mapping (optional)

RH maps and sequence maps are more detailed than chromosomal maps. RH maps are similar to linkage maps in that they show estimates of distance between genetic and physical markers, but that is where the similarity ends. RH maps are able to provide more precise information regarding the distance between markers than can a linkage map.The physical map that provides the most detail is the sequence map. Sequence maps show genetic markers, as well as the sequence between the markers measured in base pairs.

RH mapping, like linkage mapping, shows an estimated distance between genetic markers. But, rather than relying on natural recombination to separate two markers, scientists use breaks induced by radiation to determine the distance between two markers. In RH mapping, a scientist exposes DNA to measured doses of radiation, and in doing so, controls the average distance between breaks in a chromosome. By varying the degree of radiation exposure to the DNA, a scientist can induce breaks between two markers that are very close together. The ability to separate closely linked markers allows scientists to produce more detailed maps. RH mapping provides a way to localize almost any genetic marker, as well as other genomic fragments, to a defined map position, and RH maps are extremely useful for ordering markers in regions where highly polymorphic genetic markers are scarce.

Sequence tagged site, or STS mapping, is another physical mapping technique. An STS is a short DNA sequence that has been shown to be unique. To qualify as an STS, the exact location and order of the bases of the sequence must be known, and this sequence may occur only once in the chromosome being studied or in the genome as a whole if the DNA fragment set covers the entire genome. Scientists also use RH maps as a bridge between linkage maps and sequence maps. In doing so, they have been able to more easily identify the location(s) of genes involved in diseases such as spinal muscular atrophy and hyperekplexia, more commonly known as "startle disease".

To map a set of STSs, a collection of overlapping DNA fragments from a chromosome is digested into smaller fragments using restriction enzymes to cut the DNA. The data from which the map will be derived are then obtained by noting which fragments contain which STSs. To accomplish this, scientists copy the DNA fragments using a "molecular cloning". First, the fragments are inserted into a plasmid, also called a vector. After introduction into a suitable host, the DNA fragments can then be reproduced along with the host cell DNA, providing unlimited material for experimental study. An unordered set of cloned DNA fragments is called a library.

Next, the clones, or copies, are assembled in the order they would be found in the original chromosome by determining which clones contain overlapping DNA fragments. This assembly of overlapping clones is called a clone contig. Once the order of the clones in a chromosome is known, the clones are placed in frozen storage, and the information about the order of the clones is stored in a computer, providing a valuable resource that may be used for further studies. These data are then used as the base material for generating a lengthy, continuous DNA sequence, and the STSs serve to anchor the sequence onto a physical map.

As with most complex techniques, STS-based mapping has its limitations. In addition to gaps in clone coverage, DNA fragments may become lost or mistakenly mapped to a wrong position. These errors may occur for a variety of reasons. A DNA fragment may break, resulting in an STS that maps to a different position. DNA fragments may also get deleted from a clone during the replication process, resulting in the absence of a STS that should be present. Sometimes a clone composed of DNA fragments from two distinct genomic regions is replicated, leading to DNA segments that are widely separated in the genome being mistakenly mapped to adjacent positions. Lastly, a DNA fragment may become contaminated with host genetic material, once again leading to an STS that will map to the wrong location.

To help overcome these problems, as well as to improve overall mapping accuracy, researchers have begun comparing and integrating STS-based physical maps with genetic, RH and cytogenetic maps. Cross-referencing different genomic maps enhances the utility of a given map, confirms STS order, and helps order and orient evolving contigs.

References: Most of this material is from several online databases:

1. http://biotech.icmb.utexas.edu/pages/bioinfo.html

2. http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi

3. http://www.ncbi.nlm.nih.gov/

Chromo Number Chromo Location of Symbol Location Full Name and Function of Gene

Number of Genes Size in MB Centromere of Gene of Gene

				AGTR1
				KIF11
				CCL14
				PTPRU
				ODF1
				LAMA5
				IL1R1
				COL4A2

Chromo Number Chromo Location of Symbol Location Full Name and Function of Gene

Number of Genes Size in MB Centromere of Gene of Gene

				RPL12
				IDH3A
				TIMP3
				OCLN
				TAF1C
				ARSF
				MATK
				HIST1H1A

Chromo Number Chromo Location of Symbol Location Full Name and Function of Gene

Number of Genes Size in MB Centromere of Gene of Gene

				SMURF1
				TDG
				DNAM-1
				AADAT
				TFF1
				TRDV1
				SRY
				FBXO3