How to use the G1E Microarray Database

The G1E Database, G1EDb, contains the experimental results from a microarray-based analysis of differenting G1E erythroid cells. Experimental results are also available on this site in other formats, and have been deposited with the NIH in the GEO database. G1EDb is useful, however, reviewing the expression pattern of either individual genes, or groups of genes sorted by functional, numerical or other criteria. The database was designed to make it easy to get at the underlying information in an intuitive manner.

Overview

All of the functions of the G1EDb are accessed from one screen. The screen shows the transcriptional profile for one probeset of the 12,488 present on the MG-U74Av2 microarray. This information is conveyed at a glance through the use of graphic elements, but detailed information is also present. The user can page through the database probeset by probeset or search for a specific probeset using defined criteria. Subsets of data can be created using search parameters, or by manually picking probesets one at a time. Finally, data can be exported from the database in a number of formats. A number of hyperlinks connect each page to relevant web resources (assuming the user'scomputer has a live internet connection). A picture of the screen is shown below and is divided into several thematic regions, which are described in additional detail in the following text. For a larger picture of the screen shot, click on the picture.
Small picture of the main G1EDb screen Figure legend dividing the main screen into functional subsections
1Identifying information
2Data Synopsis (absolute and comparative analysis)
3Annotation (Source, Function, Interest Level) and Comments
4Graphic Profile: Absolute (top) and Relative (bottom)
5Filter Parameters (Source, Function, Numeric, Qualitative)
6Search Dialogue & Manually Marking of Records
7Output Options (HTML, Excel, Summary or Detailed Reports)

Running the Database

The database is distributed in Microsoft Access format, which is required to use the database (MS Access can not be distributed with the database). The database does not run remotely over the web, so it is necessary to download a copy from our website. Once the database file is present on your computer, double click it's icon. The database should automatically show the first probeset on the array, similar to the picture above. If the form is clipped by the edges of the screen, either drag the edges of the window to enlarge it, or hit the maximize button in the upper right hand corner of the window. If the form does not fit because of screen resolution limitations, increase your monitor's resolution using the control panel or right click on the desktop. Some institutions limit your ability to modify desktop settings; if so, either ask your IS department to change your resolution or use the scrollbars to navigate around the database.

Identifying information

Identifying information is 
found at the top of the screen; fields correspond to probeset, genbank accession number, unigene cluster assignment, locus link ID, gene symbol, and definitions of the genes according to both Genbank and Unigene.  Two buttons labeled NetAffx and Blast are also present

The most annoying problem about bioinformatics is that the underlying information is in a constant state of flux. The database is designed around the one invariant element of this experiment: the probeset ID number provided by Affymetrix. Since each probeset is based on a genbank entry, those genbank accessions were used to extract information for the database. PERL scripts were written to map genbank sequences to the U96Mm version of Unigene. The database is modular in design, and at some point when these mappings become sufficiently antiquated, this process can be repeated, substituting newer definitions. There are two titles for each gene because one set was extracted directly from Genbank, the other from Unigene. Unigene cluster assignment, locus link, and tissue specificity were extracted from Unigene entries. Some sequences were not associated with genbank accession numbers or did not map to unigene, resulting in NO_GENE and NO_UG entries in the database. Next to various fields, there are buttons. These are hotlinks to relevant NCBI and Affymetrix resources. Clicking on LocusLink will bring you to the locuslink page assigned to the gene you are viewing. Since clicking on Gene will retrieve all entries with that Gene Symbol, it could take a while to download all the entries -- usually this is not the button you want. The Probeset button will attempt to access probeset data from Affymetrix. Presently, retrieval of information from Affymetrix requires an account from Affymetrix, and login information must be entered at their site. Additionally, there are two other buttons: NetAffx and Blast. NetAffx brings you to the NetAffx site at Affymetrix, while Blast launches a Blastn of the corresponding Genbank sequence. This search is performed at the NCBI and returned in a window. The expected behavior of these buttons is that they will open new windows in a web-browser. Hitting the browser's back button should return you to the database and close the window. Since the probeset ID and genbank accession number should not change, you cannot write in those fields. The other fields may be changed by the user, but be careful. Once you leave the record, those changes become part of the database. It might make sense to make a copy of the database when you download it, just in case something like that happens.

Data Synopsis

The data synopsis provides a numerical look at the transcript and gives an indication of how well the three experimental replicates agreed. All of these values are derived from analysis with Affymetrix MAS 5.0 software. The synopsis is divided into three tiers.

Absolute

The top is the absolute signal value, an average of the three replicates. The letters under these boxes are the "Presence Calls" for each experiment, where "P" indicates Present, "A" indicates absent and "M" is marginal.

Comparative versus Zero

Down a level is the comparative analysis versus time zero. The values are signal log2 ratios of each time points (3 to 30 hours) versus time zero. The value is derived from Affymetrix MAS 5.0 software comparative analysis which compares an experimental array to a baseline array from the same experiment in a probe-by-probe manner. This yields a ratio, which is then transformed to a log scale. Since there are three replicates, to summarize the fold change, these logarithmic values are linearlized, averaged, and then converted back into log2. This is the value in the box. Beneath each time point is a graphic representation of the "Change Call" generated by Affymetrix. Here is the key:
Decreased
Marginally Decreased
No Change
Marginally Increased
Increased

Two-point Comparative

Accelerations in transcripts may be discerned by comparing one time point to an adjacent point. In this analysis, each time point is compared to the previous time point which serves as a baseline (so the first column is 3 hours versus 0 hours, the next is 7 hours versus 3 hours, etc.). Again, comparions were made within individual experiments, and the same conventions were used as above. Below the change calls, two boxes show the greatest changes for the transcript in either the comparison versus time zero or between two points. This is a quick way of looking for genes that change a lot. Next to these numbers is an indicator which will read "Always Present", "Always Absent" or "variable", based on the presence/absence calls. Transcripts which are absent in all 18 arrays can be quickly identified by the dark background behind "Always Absent".

Annotations

The initial annotations were derived automatically from a number of databases including Unigene, the GO database and the Mouse Genome Database. Additionally, many transcripts were manually reviewed and categorized. The field at the bottom contains comments pasted in from literature or database searches. An "importance" rating is also present; this allows the user to subjectively designate a probeset as "a winner", "interesting", "maybe" or "forget it". The "Murine" checkbox is present because not every probeset on the array is murine -- for instance, there several control probesets on the array such as Cre. Finally, beneath the annotion is the record number. There are a total of 12,488 probesets on the array, each with a unique record number. If the dataset is filtered in some way, this area will state the number of records which passed the filter and can be viewed. If you are not seeing all of the transcripts, a filter may have restricted the dataset. To revert to an unfiltered view, hit the reset all button in the upper-right hand panel.

Graphic Profile

Two graphs are generated for each probeset, a plot of absolute signal (in red) and a relativized plot (in green). The absolute signal plot is just another way of viewing the absolute signal numbers shown on the left side of the form. To generate the relativized plot, the highest value is set to 100% and the lowest is defined as 0%; intermediate values are scaled. The intention of this plot was to allow comparison of profiles with different absolute signals, but the same overall shape. Different probe affinities/hybridization can result in variable signal levels. Using one scale for all transcripts would make it hard to see low signals. This plot is misleading when a signal does not change much, because it magnifies the scale. If these graphs are selected and copied to the clipboard in Access, they can be pasted into other applications.

Filter Parameters

To narrow down the number of genes displayed (or for output), you may filter the data using several types of parameters, each on its own tabbed page. Parameters from multiple pages may be used, with the understanding that there is an implicit Boolean "AND" in effect -- that is, only data which satisfies all of the filter requirements will be displayed. Thus, if you check "erythroid" and "liver", some erythroid genes will not be displayed because they are not also hepatic genes. To implement the filter, press the "Set Limits" button. To remove all filtering, press "Reset All".

The first page is source, and this uses tissue-specific information derived from Unigene and manual annotation. Non-murine sequences (such as microarray controls) can be excluded by checking the "murine only" check box. Checking the "All" sources box will un-check all other source boxes (conversely, if you check any specific tissue box, it will remove the check from the default "all tissues" check box).
Transcripts may be filtered by their function, assigned by GO database and manual review, using the second tab. Again, the "All" checkbox works counter to the individual check boxes.
These are the numeric criteria which people usually think about for filtering microarray data. If the top check box is marked, all genes which are considered "absent" on each of the 18 chips in the experiment will be excluded. This places some faith in the Affymetrix "presence/absence" algorithm, and may exclude relevant genes. The next two criteria are only activated when their check boxes are marked. The minimum signal threshold is a way of filtering transcripts which have low signal levels throughout the time course. If a transcript's average signal level is above the threshold in any time point, it will be included in the filtered dataset. A value of around 100 seems to work for our dataset. The signal log ratio versus zero is a filter parameter which attempts to identify only those genes whose expression in at least one time point varies from a minimal amount from baseline. The number is a log2 value, so an SLR of 1 is actually a fold-change of 2. The filter takes into account both positive and negative changes, so specifying an SLR of 1 means that only genes that are 2-fold induced or repressed versus time zero will be included in the analysis. To get an idea how these parameters narrow the dataset, see a discussion of filtering parameters.
Qualitative options allow the user to restrict the dataset according to subjective criteria. For instance, records may be marked manually using the "Mark Record" checkbox in the search area. This subset of records can then be isolated by applying a "Marked Records Only" filter. The resulting dataset can then be saved or printed. Data can also be filtered according the "interest rating" assigned by the user. Thus, you can limit the data to the genes which are "interesting".

Search Parameters

Search parameter dialogue

Within the dataset (which may have been restricted through filtering discussed above, records can be searched according to a number of parameters. The drop box determines which field will be searched, for example, gene symbol (the internationally agreed upon abbreviation of the gene's name). The search term may be entered in the lower box. If you provide a prefix, the search will return all hits that start with the prefix. For example, "hb" will return all hemoglobin genes (hba-a1, hbb-b1, hbb-b2, etc.). Wildcards may also be used, where * indicates any character or string, and ? indicates any single character. More advanced searches using regular expressions are not implemented, nor can all fields be searched simultaneously. The Mark Record checkbox will mark the record which is currently displayed. Search and Mark can be used together to generate a custom dataset manually. After marking the relevant records, filter the dataset to include only the marked records by using the qualitative fitlering dialogue above.

Output Options

Output options dialogue.  Buttons for outputing a file or printing a report

The output options will function on the current dataset, so it it has been filtered, only a subset of the entire database will be written out to a file or printed. That's good, because outputing large files or printing large reports takes a while, and the resulting files or print jobs can be very large if the data are not somehow restricted. There are two options for outputing a file: either writing it in Microsoft Excel format or as HTML. Often, it is easier to manipulate large chunks of information in a spreadsheet. Additionally, two types of reports (shown below) can be generated from the database: a detailed report and a summary. The detailed report provides essentially the same information as the database screen for each probeset, and probesets are printed one to a page. The summary view prints some identifying information, the absolute signal values for each time point, and reproduces the relative expression plot for each transcript. Transcripts printed as a summary are roughly organized by transcription profile shape, but no formal clustering methology should be assumed. Multiple transcripts are printed on each page of the summary report.

G1E Microarray Home

Please direct all comments to: John Welch, M.D., Ph.D.
Last modification: August 16, 2003