Welcome to the Integration Site Pipeline and Database, which is universally referred to (unfortunately) as INSIPID!
The purpose of this web-based tool is to house sequences of newly inserted elements in vertebrate genomes and allow users to investigate their locations. In a typical experiment, a researcher might collect a set of sequences of HIV integration sites in a human T-cell line. INSIPID can house the sequences together with associated annotation, then allow experimenters to call up the sequences and information about them. Comparison to random integration sites or other integration site data sets allows targeting biases to be detected and analyzed.
Let's work through an example. On the INSIPID main page (http://microb215.med.upenn.edu/insipid/) in the upper left corner is a set of tabs indicating the genome hosting integration sites ("Hs" is "Homo sapiens", etc.). Click on the Hs tab. These are integration site data sets deposited by various users, listed in alphabetical order.
Navigate down to the data set marked "Schroder-HIV-SupT1-inVivo". These are integration sites from the first large-scale study of integration targeting in the human genome (A. W. Schroder, P. Shinn, H. Chen, C. Berry, J. R. Ecker, and F. D. Bushman. (2002) HIV-1 Integration in the Human Genome Favors Active Genes and Local Hotspots. Cell 110, 521-529). Click on the name of the data set.
Further down the left column is the "Submit" button. Hit "Submit".
After a pause, you will see a listing of the number of the unique integration sites in the data set that passed quality control, in this case 541. Below that is a list of each integration site sequence. To the right are features of that sequence. ChrNum is the number of the human chromosome that hosted that integration site, Ort is the orientation of the sequence relative to the genomic sequences, IntSite is the nucleotide in the human genome adjacent to the site. Rightward of that are various catalogs of the human genes (Acembly, Genescan, Known, Refseq, Unigene, Ensembl). Gene designations are in each row for each integration site within a genes. Right of that is an indication of whether the integration site is in a CpG island or a repetitive element as called by Repeat Masker.
Note that users can configure the columns to contain different types of annotation. Click on the button Change column prefs and you will see a list of the current types of annotation available. After choosing the preferred columns, click on the “Submit” button at the bottom of the page, then click on “Return to viewer” link on top. If you were viewing a dataset before, you may need to select the dataset again and hit “Submit” to view the new columns.
These spreadsheets can be exported to Excel by clicking the "xls" button at the upper right, then pressing Submit. INSIPID will then allow you to save the file to a location of your choice.
Asking a Question in InSiPiD
Let's use INSIPID to investigate the question of whether HIV favors integration in transcription units defined by Refseq genes. We just determined that there are 541 unique integration sites in the Schroder in vivo data set. We will determine how many of those are in Refseq genes, then compare them to a control that mimics random integration in the human genome.
Go to INSIPID, click on "Schroder-HIV-SupT1-inVivo". Now in the central column at the top, under "Has Refseq?" hit yes and hit "Submit". INSIPID reports that 368 integration sites in the data set are within Refseq genes. Below are listed the specific sites and specific genes. Thus of the 541 integration sites, 368 area in Refseq genes and 173 are outside of genes.
Is this a significant bias in favor of integration in Refseq genes by HIV? To answer this, we need to compare the HIV integration site data to random data. There are several ways to do this. The one used in the Schroder et al. paper involved carrying out HIV integration in a test tube, using purified human DNA as the integration target. Integration sites were then cloned and sequenced from this population. This allowed HIV integration targeting to be assessed in the absence of chromosomal proteins, nuclear architecture, etc., mimicking random integration in the human genome.
To analyze the in vitro data set, click on "Schroder-HIV-SupT1-inVitro". Reset the "Has Refseq?" button back to "don't care". Now hit "Submit". We learn that the data set has 125 unique sites. Now press "yes" under "Has Refseq?" and hit "Submit" again. We learn that 54 clones are within Refseq genes, which means that 71 are outside Refseq genes.
Is this a significant difference? We can carry out a simple statistical test in any Statistical software package. I usually use GraphPad Prism. Enter the four values into a spreadsheet to make a 2X2 contingency table, than analyze using Fisher's exact test. We find that the P value is <0.0001.
Thus HIV integration is highly favored within Refseq transcription units!
This same set of tools can be used to ask many questions about biases in integration target site selection. Just choose an integration site data set and an appropriate random control, then analyze "in" versus "out" as a 2X2 contingency Table. For many data sets, computationally generated matched random controls (labeled "MRC") are available.
Enjoy!
On Macs, the latest version of INSIPID runs on Firefox or Safari beta, but not early versions of Safari. On PCs, INSIPID runs well on recent versions of Internet Explorer.