RAP Probe Generation
This page demonstrates how to generate probe sequences for RNA Antisense Purification (RAP) using the function rap_probes()
. The function divides a fasta file into fragments of a user-specified length, converts them to their reverse complement, and adds any additional sequences (either an adapter sequence or 5’-biotin). Additionally, the function has the option of filtering probes for multimappers using BLAT, and for repetitive elements using
DFAM.
BLAT filtering in rap_probes()
ignores multimappers found within the targeted gene. lncRNAs such as Xist and Kcnq1ot1 contain internally repetitive sequences that are found nowhere else in the genome. BLAT will nonetheless identify probes to these regions as multimappers, but all the homologous regions will lie within the intended target. When this situation occurs, rap_probes()
ignores the multimapping and keeps the probe.
That said, lncRNA research is plagued by studies that fail to note known repetitive elements within annotated lncRNAs. For example, say a scientist performs RAP-DNA on a lncRNA containing a SINE element and does not remove SINE sequences from their probes. Genomic alignment will incorrectly indicate that the lncRNA localizes to SINEs genome-wide. In fact, this result is due to antisense purification of nascent SINE RNAs, which are present in the majority of introns. DFAM filtering reduces the likelihood of this error by removing probes to known selfish elements in a genome.
In general, RAP probes should be 60-120nt long. For protocols using heat elution, shorter probes (60-90nt) are preferable as longer probes reduce the efficiency of melting the RNA-DNA hybrid. If eluting with benzonase or another nuclease, longer probes are acceptable.
I typically recommend using at least 50% coverage of the RNA to ensure efficient capture. For protocols where only a specific part of the RNA is of interest (e.g. RAP-MS for the A-repeat of Xist), it may be possible to tile that specific locus. Be advised, however, that we have not systematically tested this and that it is likely sensitive to the extent of RNA fragmentation. Insufficiently fragmented RNA may be difficult to target with probes to a specific subregion.
Following This Demo
This notebook and the xist.fasta
file are available on Github.
Installation
rap_probes()
is distributed as part of the probeutils
package (at the moment it is the only function, but this will be expanded in time). Installation is simple using pip:
pip install probeutils
Input file
The FASTA interpreter in rap_probes()
accepts either a file containing only DNA bases and newlines, or a file with one or more FASTA sequences separated by headers beginning with ‘>’. If the file contains multiple sequences, rap_probes()
will only generate probes for the first entry.
For this example, we will use human Xist downloaded from NCBI:
[1]:
with open('xist.fasta', 'r') as x:
xist = x.readlines()
The first line of the file contains the gene identity information:
[2]:
xist[0]
[2]:
'>NR_001564.2 Homo sapiens X inactive specific transcript (XIST), long non-coding RNA\n'
The final line is a newline character:
[3]:
xist[-1]
[3]:
'\n'
And all the lines in between are sequences separated by newlines:
[4]:
xist[1:10]
[4]:
['CCTTCAGTTCTTAAAGCGCTGCAATTCGCTGCTGCAGCCATATTTCTTACTCTCTCGGGGCTGGAAGCTT\n',
'CCTGACTGAAGATCTCTCTGCACTTGGGGTTCTTTCTAGAACATTTTCTAGTCCCCCAACACCCTTTATG\n',
'GCGTATTTCTTTAAAAAAATCACCTAAATTCCATAAAATATTTTTTTAAATTCTATACTTTCTCCTAGTG\n',
'TCTTCTTGACACGTCCTCCATATTTTTTTAAAGAAAGTATTTGGAATATTTTGAGGCAATTTTTAATATT\n',
'TAAGGAATTTTTCTTTGGAATCATTTTTGGTTGACATCTCTGTTTTTTGTGGATCAGTTTTTTACTCTTC\n',
'CACTCTCTTTTCTATATTTTGCCCATCGGGGCTGCGGATACCTGGTTTTATTATTTTTTCTTTGCCCAAC\n',
'GGGGCCGTGGATACCTGCCTTTTAATTCTTTTTTATTCGCCCATCGGGGCCGCGGATACCTGCTTTTTAT\n',
'TTTTTTTTCCTTAGCCCATCGGGGTATCGGATACCTGCTGATTCCCTTCCCCTCTGAACCCCCAACACTC\n',
'TGGCCCATCGGGGTGACGGATATCTGCTTTTTAAAAATTTTCTTTTTTTGGCCCATCGGGGCTTCGGATA\n']
Running rap_probes()
After installing the probeutils
package, rap_probes()
should be available for import.
[5]:
# Required only for demonstration
import os
import pandas as pd
# Probe generation
from probe_utils import rap_probes
rap_probes()
currently accepts eight parameters. The following is copied from the function description:
fasta : str
Path to a fasta file containing the sequence to
generate probes against
gene : str
The name of the target gene, used to name probes
and the output file
adaptseq : str
Any nucleotides that should be added to the 5'-end
of each probe. These are used for ligating probes
to beads or barcodes. By default, the value is set
to the first SPRITE barcode. If no adapter is required,
set this parameter to ''. Default 'CAAGTCA'
probe_length : int
The total length of the probe in nucleotides. If
an adaptor is used, this length includes the length
of the adapter. Default 90
biotin : Bool
Whether to add a 5'-biotin to the probes. Formatted
for ordering from Integrated DNA Technologies (IDT).
Default False
blat : Bool
Whether to filter probes for multiple genome matches
using UCSC BLAT. If True, the genome assembly name
must be supplied to **kwargs. Default True
dfam : Bool
Whether to filter probes for transposable elements and
tandem repeats using the Institute of Systems Biology's
Dfam database. If True, the species name must be supplied
to **kwargs. Default True
**kwargs : dictionary
genome : str
Used for BLAT filtering. Short assembly name for the
species genome as listed in BLAT, e.g. 'hg38,' 'mm39,'
or 'dm6'
tolerance : int
Used for BLAT filtering. Number of acceptable matches
to other genomic loci. Default 25
species : str
DFAM species to check repeats, e.g. "Homo sapiens",
"Mus musculus", or "Drosophila melanogaster"
The function exports a five files into a directory called [gene] + _rapProbesOutput/
. It also returns a Pandas DataFrame with the final probe sequences and names. The full list of files outputted is in the function description:
output : a Pandas DataFrame
A dataframe containing the final probes after filtering
steps. Identical to the Probes.csv file
rapProbesLog.out : a text file
A text file containing a log of steps taken by the
rap_probes function
[gene]_[probe_length]ntProbes.csv : a csv file
A csv file containing the final probes. Identical
to the Pandas Dataframe ouput
blatFailedProbes.csv : a csv file
If performing BLAT filtering, a csv file containing BLAT
results for probes that did not pass filters
blatPassedProbes.csv : a csv file
If performing BLAT filtering, a csv file containing BLAT
results for probes that passed filters
dfamFailedProbes.csv : a csv file
If performing Dfam filtering, a csv file containing Dfam
results for probes that did not pass filters
In this example, we will generate probes for human Xist. Given that Xist is a unique genomic locus, we want to use both BLAT and DFAM to filter out non-specific probes. If targeting a repetitive element (e.g. LINE1) or a multicopy gene (e.g. U1 snRNA), these filters should be turned off.
The genome BLAT queries is specified by the keyword argument ‘genome’, and the species DFAM searches is determined by the keyword argument ‘species’. Refer to the current builds of the two databases to determine which genomes and species are supported.
[6]:
kwargs = {'genome':'hg38',
'species':'Homo sapiens'}
With all of this understood we are now ready to run the script. Progress messages will appear if running BLAT or DFAM, and a final message (‘Probe generation complete’) will inform you that the script ran successfully.
[7]:
df = rap_probes(fasta = 'xist.fasta',
gene = 'HuXist',
adaptseq = 'CAAGTCA',
probe_length = 90,
biotin = False,
blat = True,
dfam = True,
**kwargs)
Starting BLAT
100%|███████████████████████████████████████████| 10/10 [00:45<00:00, 4.52s/it]
BLAT Done
Starting Dfam
Search submitted successfully.
DFAM Done
Probe generation complete
We can now see that all the output files have been sent to HuXist_rapProbesOut/
[8]:
os.listdir('HuXist_rapProbesOutput/')
[8]:
['HuXist_90ntProbes.csv',
'rapProbesLog.out',
'blatPassedProbes.csv',
'dfamFailedProbes.csv',
'.ipynb_checkpoints',
'blatFailedProbes.csv']
We can check how many probes were originally generated, and how many BLAT and DFAM filtered.
[9]:
with open('HuXist_rapProbesOutput/rapProbesLog.out','r') as f:
print(f.read())
Probe Design Log for HuXist
Original probes generated: 233
BLAT Results
Identified locus: chrX:73820651-73852753 (-)
Genome Match: 100.0%
Probes remaining after BLAT: 137
Dfam Results
Search submitted successfully
Dfam search time: 19 seconds
Probes remaining after Dfam: 136
From this we can see that BLAT correctly identified the Xist locus and that the input file had a 100% match to the hg38 genome. If this match had been lower than 100%, the user would have been asked whether to abort the program or continue.
Around 60% of probes remain, which should be acceptable for most RAP applications.
If we are curious about why probes failed, we can look at the probes that failed BLAT and DFAM filtering. In this case, the probe originally named 218 was particularly problematic:
[10]:
blat = pd.read_csv('HuXist_rapProbesOutput/blatFailedProbes.csv')
blat[blat['qName'] == 218]
[10]:
matches | qName | tName | tStart | tEnd | qStarts | tStarts | |
---|---|---|---|---|---|---|---|
983 | 83 | 218 | chrX | 73821753 | 73821836 | 0 | 73821753 |
984 | 40 | 218 | chr2 | 108652932 | 108653003 | 35,42,57,77 | 108652932,108652940,108652952,108652997 |
985 | 29 | 218 | chr7 | 127736479 | 127736518 | 33,50 | 127736479,127736505 |
986 | 29 | 218 | chr16 | 14040243 | 14040279 | 26,32,49 | 14040243,14040254,14040273 |
987 | 28 | 218 | chr6 | 152466529 | 152466566 | 22,27,35 | 152466529,152466533,152466549 |
988 | 27 | 218 | chr12 | 80656242 | 80656277 | 2,12 | 80656242,80656259 |
989 | 26 | 218 | chr2 | 110474587 | 110474900 | 0,6 | 110474587,110474880 |
990 | 26 | 218 | chr4 | 41387480 | 41387513 | 32,54 | 41387480,41387505 |
991 | 24 | 218 | chr4 | 22061022 | 22061057 | 25,32 | 22061022,22061040 |
992 | 24 | 218 | chr2 | 181947660 | 181947690 | 59,75 | 181947660,181947682 |
993 | 24 | 218 | chr11 | 87347049 | 87347074 | 43 | 87347049 |
994 | 23 | 218 | chrX | 11420280 | 11420310 | 21,35 | 11420280,11420301 |
995 | 22 | 218 | chr6 | 18851074 | 18851096 | 28 | 18851074 |
996 | 22 | 218 | chrUn_GL000195v1 | 96928 | 96950 | 33 | 96928 |
997 | 22 | 218 | chr21 | 6277553 | 6277575 | 33 | 6277553 |
998 | 21 | 218 | chr7 | 123730795 | 123730816 | 33 | 123730795 |
999 | 20 | 218 | chr3 | 157490812 | 157490832 | 43 | 157490812 |
1000 | 20 | 218 | chr13 | 64330699 | 64330719 | 32 | 64330699 |
1001 | 20 | 218 | chr1 | 107901625 | 107901645 | 21 | 107901625 |
1002 | 25 | 218 | chr14 | 27657732 | 27657762 | 20 | 27657732 |
DFAM only identified a single probe containing a repetitive element, the hAT transposon MER58C:
[11]:
pd.read_csv('HuXist_rapProbesOutput/dfamFailedProbes.csv')
[11]:
probe | query | type | e_value | |
---|---|---|---|---|
0 | GGTAAGCTATGAACAGCAGGCCAAATCCAATTGGCTCAAAAACTAA... | MER58C | DNA | 0.000034 |
The final probes are located in the file ‘HuXist_90ntProbes.csv’, but the output of rap_probes()
let’s you explore the file directly without importing it:
[12]:
df.head()
[12]:
Name | Sequence | |
---|---|---|
0 | HuXist_0 | CAAGTCAATCTTCAGTCAGGAAGCTTCCAGCCCCGAGAGAGTAAGA... |
1 | HuXist_1 | CAAGTCATAGGTGATTTTTTTAAAGAAATACGCCATAAAGGGTGTT... |
2 | HuXist_2 | CAAGTCAAATCTGAACACGCCCTTAGCTTAACTGCAGAGTCATTCT... |
3 | HuXist_3 | CAAGTCAAAAGGGAGTCCATGAGAAGGTGCCCTTATCTAGTACACA... |
4 | HuXist_4 | CAAGTCATACTGCAAATGGAGGGTGAGAAGGTAGAACTTTGTTTAA... |
From here, it is easy to convert the .csv
file to a format that can be ordered on IDT plates or an array.
Computing Environment
[13]:
%load_ext watermark
%watermark -v -p probe_utils,jupyterlab,os,pandas
Python implementation: CPython
Python version : 3.9.17
IPython version : 8.12.0
probe_utils: 1.0.3
jupyterlab : 3.6.3
os : unknown
pandas : 2.0.3