RAP Probe Generation

This page demonstrates how to generate probe sequences for RNA Antisense Purification (RAP) using the function rap_probes(). The function divides a fasta file into fragments of a user-specified length, converts them to their reverse complement, and adds any additional sequences (either an adapter sequence or 5’-biotin). Additionally, the function has the option of filtering probes for multimappers using BLAT, and for repetitive elements using DFAM.

BLAT filtering in rap_probes() ignores multimappers found within the targeted gene. lncRNAs such as Xist and Kcnq1ot1 contain internally repetitive sequences that are found nowhere else in the genome. BLAT will nonetheless identify probes to these regions as multimappers, but all the homologous regions will lie within the intended target. When this situation occurs, rap_probes() ignores the multimapping and keeps the probe.

That said, lncRNA research is plagued by studies that fail to note known repetitive elements within annotated lncRNAs. For example, say a scientist performs RAP-DNA on a lncRNA containing a SINE element and does not remove SINE sequences from their probes. Genomic alignment will incorrectly indicate that the lncRNA localizes to SINEs genome-wide. In fact, this result is due to antisense purification of nascent SINE RNAs, which are present in the majority of introns. DFAM filtering reduces the likelihood of this error by removing probes to known selfish elements in a genome.

In general, RAP probes should be 60-120nt long. For protocols using heat elution, shorter probes (60-90nt) are preferable as longer probes reduce the efficiency of melting the RNA-DNA hybrid. If eluting with benzonase or another nuclease, longer probes are acceptable.

I typically recommend using at least 50% coverage of the RNA to ensure efficient capture. For protocols where only a specific part of the RNA is of interest (e.g. RAP-MS for the A-repeat of Xist), it may be possible to tile that specific locus. Be advised, however, that we have not systematically tested this and that it is likely sensitive to the extent of RNA fragmentation. Insufficiently fragmented RNA may be difficult to target with probes to a specific subregion.

Following This Demo

This notebook and the xist.fasta file are available on Github.

Installation

rap_probes() is distributed as part of the probeutils package (at the moment it is the only function, but this will be expanded in time). Installation is simple using pip:

pip install probeutils

Input file

The FASTA interpreter in rap_probes() accepts either a file containing only DNA bases and newlines, or a file with one or more FASTA sequences separated by headers beginning with ‘>’. If the file contains multiple sequences, rap_probes() will only generate probes for the first entry.

For this example, we will use human Xist downloaded from NCBI:

[1]:
with open('xist.fasta', 'r') as x:
    xist = x.readlines()

The first line of the file contains the gene identity information:

[2]:
xist[0]
[2]:
'>NR_001564.2 Homo sapiens X inactive specific transcript (XIST), long non-coding RNA\n'

The final line is a newline character:

[3]:
xist[-1]
[3]:
'\n'

And all the lines in between are sequences separated by newlines:

[4]:
xist[1:10]
[4]:
['CCTTCAGTTCTTAAAGCGCTGCAATTCGCTGCTGCAGCCATATTTCTTACTCTCTCGGGGCTGGAAGCTT\n',
 'CCTGACTGAAGATCTCTCTGCACTTGGGGTTCTTTCTAGAACATTTTCTAGTCCCCCAACACCCTTTATG\n',
 'GCGTATTTCTTTAAAAAAATCACCTAAATTCCATAAAATATTTTTTTAAATTCTATACTTTCTCCTAGTG\n',
 'TCTTCTTGACACGTCCTCCATATTTTTTTAAAGAAAGTATTTGGAATATTTTGAGGCAATTTTTAATATT\n',
 'TAAGGAATTTTTCTTTGGAATCATTTTTGGTTGACATCTCTGTTTTTTGTGGATCAGTTTTTTACTCTTC\n',
 'CACTCTCTTTTCTATATTTTGCCCATCGGGGCTGCGGATACCTGGTTTTATTATTTTTTCTTTGCCCAAC\n',
 'GGGGCCGTGGATACCTGCCTTTTAATTCTTTTTTATTCGCCCATCGGGGCCGCGGATACCTGCTTTTTAT\n',
 'TTTTTTTTCCTTAGCCCATCGGGGTATCGGATACCTGCTGATTCCCTTCCCCTCTGAACCCCCAACACTC\n',
 'TGGCCCATCGGGGTGACGGATATCTGCTTTTTAAAAATTTTCTTTTTTTGGCCCATCGGGGCTTCGGATA\n']

Running rap_probes()

After installing the probeutils package, rap_probes() should be available for import.

[5]:
# Required only for demonstration
import os
import pandas as pd

# Probe generation
from probe_utils import rap_probes

rap_probes() currently accepts eight parameters. The following is copied from the function description:

fasta : str
    Path to a fasta file containing the sequence to
    generate probes against

gene : str
    The name of the target gene, used to name probes
    and the output file

adaptseq : str
    Any nucleotides that should be added to the 5'-end
    of each probe. These are used for ligating probes
    to beads or barcodes. By default, the value is set
    to the first SPRITE barcode. If no adapter is required,
    set this parameter to ''. Default 'CAAGTCA'

probe_length : int
    The total length of the probe in nucleotides. If
    an adaptor is used, this length includes the length
    of the adapter. Default 90

biotin : Bool
    Whether to add a 5'-biotin to the probes. Formatted
    for ordering from Integrated DNA Technologies (IDT).
    Default False

blat : Bool
    Whether to filter probes for multiple genome matches
    using UCSC BLAT. If True, the genome assembly name
    must be supplied to **kwargs. Default True

dfam : Bool
    Whether to filter probes for transposable elements and
    tandem repeats using the Institute of Systems Biology's
    Dfam database. If True, the species name must be supplied
    to **kwargs. Default True

**kwargs : dictionary

    genome : str
        Used for BLAT filtering. Short assembly name for the
        species genome as listed in BLAT, e.g. 'hg38,' 'mm39,'
        or 'dm6'

    tolerance : int
        Used for BLAT filtering. Number of acceptable matches
        to other genomic loci. Default 25

    species : str
        DFAM species to check repeats, e.g. "Homo sapiens",
        "Mus musculus", or "Drosophila melanogaster"

The function exports a five files into a directory called [gene] + _rapProbesOutput/. It also returns a Pandas DataFrame with the final probe sequences and names. The full list of files outputted is in the function description:

output : a Pandas DataFrame
    A dataframe containing the final probes after filtering
    steps. Identical to the Probes.csv file

rapProbesLog.out : a text file
    A text file containing a log of steps taken by the
    rap_probes function

[gene]_[probe_length]ntProbes.csv : a csv file
    A csv file containing the final probes. Identical
    to the Pandas Dataframe ouput

blatFailedProbes.csv : a csv file
    If performing BLAT filtering, a csv file containing BLAT
    results for probes that did not pass filters

blatPassedProbes.csv : a csv file
    If performing BLAT filtering, a csv file containing BLAT
    results for probes that passed filters

dfamFailedProbes.csv : a csv file
    If performing Dfam filtering, a csv file containing Dfam
    results for probes that did not pass filters

In this example, we will generate probes for human Xist. Given that Xist is a unique genomic locus, we want to use both BLAT and DFAM to filter out non-specific probes. If targeting a repetitive element (e.g. LINE1) or a multicopy gene (e.g. U1 snRNA), these filters should be turned off.

The genome BLAT queries is specified by the keyword argument ‘genome’, and the species DFAM searches is determined by the keyword argument ‘species’. Refer to the current builds of the two databases to determine which genomes and species are supported.

[6]:
kwargs = {'genome':'hg38',
         'species':'Homo sapiens'}

With all of this understood we are now ready to run the script. Progress messages will appear if running BLAT or DFAM, and a final message (‘Probe generation complete’) will inform you that the script ran successfully.

[7]:
df = rap_probes(fasta = 'xist.fasta',
               gene = 'HuXist',
               adaptseq = 'CAAGTCA',
               probe_length = 90,
               biotin = False,
               blat = True,
               dfam = True,
               **kwargs)
Starting BLAT
100%|███████████████████████████████████████████| 10/10 [00:45<00:00,  4.52s/it]
BLAT Done
Starting Dfam
Search submitted successfully.
DFAM Done
Probe generation complete

We can now see that all the output files have been sent to HuXist_rapProbesOut/

[8]:
os.listdir('HuXist_rapProbesOutput/')
[8]:
['HuXist_90ntProbes.csv',
 'rapProbesLog.out',
 'blatPassedProbes.csv',
 'dfamFailedProbes.csv',
 '.ipynb_checkpoints',
 'blatFailedProbes.csv']

We can check how many probes were originally generated, and how many BLAT and DFAM filtered.

[9]:
with open('HuXist_rapProbesOutput/rapProbesLog.out','r') as f:
    print(f.read())
Probe Design Log for HuXist
Original probes generated: 233

BLAT Results
Identified locus: chrX:73820651-73852753 (-)
Genome Match: 100.0%
Probes remaining after BLAT: 137

Dfam Results
Search submitted successfully
Dfam search time: 19 seconds
Probes remaining after Dfam: 136

From this we can see that BLAT correctly identified the Xist locus and that the input file had a 100% match to the hg38 genome. If this match had been lower than 100%, the user would have been asked whether to abort the program or continue.

Around 60% of probes remain, which should be acceptable for most RAP applications.

If we are curious about why probes failed, we can look at the probes that failed BLAT and DFAM filtering. In this case, the probe originally named 218 was particularly problematic:

[10]:
blat = pd.read_csv('HuXist_rapProbesOutput/blatFailedProbes.csv')

blat[blat['qName'] == 218]
[10]:
matches qName tName tStart tEnd qStarts tStarts
983 83 218 chrX 73821753 73821836 0 73821753
984 40 218 chr2 108652932 108653003 35,42,57,77 108652932,108652940,108652952,108652997
985 29 218 chr7 127736479 127736518 33,50 127736479,127736505
986 29 218 chr16 14040243 14040279 26,32,49 14040243,14040254,14040273
987 28 218 chr6 152466529 152466566 22,27,35 152466529,152466533,152466549
988 27 218 chr12 80656242 80656277 2,12 80656242,80656259
989 26 218 chr2 110474587 110474900 0,6 110474587,110474880
990 26 218 chr4 41387480 41387513 32,54 41387480,41387505
991 24 218 chr4 22061022 22061057 25,32 22061022,22061040
992 24 218 chr2 181947660 181947690 59,75 181947660,181947682
993 24 218 chr11 87347049 87347074 43 87347049
994 23 218 chrX 11420280 11420310 21,35 11420280,11420301
995 22 218 chr6 18851074 18851096 28 18851074
996 22 218 chrUn_GL000195v1 96928 96950 33 96928
997 22 218 chr21 6277553 6277575 33 6277553
998 21 218 chr7 123730795 123730816 33 123730795
999 20 218 chr3 157490812 157490832 43 157490812
1000 20 218 chr13 64330699 64330719 32 64330699
1001 20 218 chr1 107901625 107901645 21 107901625
1002 25 218 chr14 27657732 27657762 20 27657732

DFAM only identified a single probe containing a repetitive element, the hAT transposon MER58C:

[11]:
pd.read_csv('HuXist_rapProbesOutput/dfamFailedProbes.csv')
[11]:
probe query type e_value
0 GGTAAGCTATGAACAGCAGGCCAAATCCAATTGGCTCAAAAACTAA... MER58C DNA 0.000034

The final probes are located in the file ‘HuXist_90ntProbes.csv’, but the output of rap_probes() let’s you explore the file directly without importing it:

[12]:
df.head()
[12]:
Name Sequence
0 HuXist_0 CAAGTCAATCTTCAGTCAGGAAGCTTCCAGCCCCGAGAGAGTAAGA...
1 HuXist_1 CAAGTCATAGGTGATTTTTTTAAAGAAATACGCCATAAAGGGTGTT...
2 HuXist_2 CAAGTCAAATCTGAACACGCCCTTAGCTTAACTGCAGAGTCATTCT...
3 HuXist_3 CAAGTCAAAAGGGAGTCCATGAGAAGGTGCCCTTATCTAGTACACA...
4 HuXist_4 CAAGTCATACTGCAAATGGAGGGTGAGAAGGTAGAACTTTGTTTAA...

From here, it is easy to convert the .csv file to a format that can be ordered on IDT plates or an array.

Computing Environment

[13]:
%load_ext watermark
%watermark -v -p probe_utils,jupyterlab,os,pandas
Python implementation: CPython
Python version       : 3.9.17
IPython version      : 8.12.0

probe_utils: 1.0.3
jupyterlab : 3.6.3
os         : unknown
pandas     : 2.0.3