RAP Probe Generation

This page demonstrates how to generate probe sequences for RNA Antisense Purification (RAP) using the function rap_probes(). The function divides a fasta file into fragments of a user-specified length, converts them to their reverse complement, and adds any additional sequences (either an adapter sequence or 5’-biotin). Additionally, the function has the option of filtering probes for multimappers using BLAT, and for repetitive elements using DFAM.

BLAT filtering in rap_probes() ignores multimappers found within the targeted gene. lncRNAs such as Xist and Kcnq1ot1 contain internally repetitive sequences that are found nowhere else in the genome. BLAT will nonetheless identify probes to these regions as multimappers, but all the homologous regions will lie within the intended target. When this situation occurs, rap_probes() ignores the multimapping and keeps the probe.

That said, lncRNA research is plagued by studies that fail to note known repetitive elements within annotated lncRNAs. For example, say a scientist performs RAP-DNA on a lncRNA containing a SINE element and does not remove SINE sequences from their probes. Genomic alignment will incorrectly indicate that the lncRNA localizes to SINEs genome-wide. In fact, this result is due to antisense purification of nascent SINE RNAs, which are present in the majority of introns. DFAM filtering reduces the likelihood of this error by removing probes to known selfish elements in a genome.

In general, RAP probes should be 60-120nt long. For protocols using heat elution, shorter probes (60-90nt) are preferable as longer probes reduce the efficiency of melting the RNA-DNA hybrid. If eluting with benzonase or another nuclease, longer probes are acceptable.

I typically recommend using at least 50% coverage of the RNA to ensure efficient capture. For protocols where only a specific part of the RNA is of interest (e.g. RAP-MS for the A-repeat of Xist), it may be possible to tile that specific locus. Be advised, however, that we have not systematically tested this and that it is likely sensitive to the extent of RNA fragmentation. Insufficiently fragmented RNA may be difficult to target with probes to a specific subregion.

Following This Demo

This notebook and the xist.fasta file are available on Github.

Installation

rap_probes() is distributed as part of the probeutils package (at the moment it is the only function, but this will be expanded in time). Installation is simple using pip:

pip install probeutils

Input file

The FASTA interpreter in rap_probes() accepts either a file containing only DNA bases and newlines, or a file with one or more FASTA sequences separated by headers beginning with ‘>’. If the file contains multiple sequences, rap_probes() will only generate probes for the first entry.

For this example, we will use human Xist downloaded from NCBI:

[1]:

with open('xist.fasta', 'r') as x:
    xist = x.readlines()

The first line of the file contains the gene identity information:

[2]:

xist[0]

[2]:

'>NR_001564.2 Homo sapiens X inactive specific transcript (XIST), long non-coding RNA\n'

The final line is a newline character:

[3]:

xist[-1]

[3]:

'\n'

And all the lines in between are sequences separated by newlines:

[4]:

xist[1:10]

[4]:

['CCTTCAGTTCTTAAAGCGCTGCAATTCGCTGCTGCAGCCATATTTCTTACTCTCTCGGGGCTGGAAGCTT\n',
 'CCTGACTGAAGATCTCTCTGCACTTGGGGTTCTTTCTAGAACATTTTCTAGTCCCCCAACACCCTTTATG\n',
 'GCGTATTTCTTTAAAAAAATCACCTAAATTCCATAAAATATTTTTTTAAATTCTATACTTTCTCCTAGTG\n',
 'TCTTCTTGACACGTCCTCCATATTTTTTTAAAGAAAGTATTTGGAATATTTTGAGGCAATTTTTAATATT\n',
 'TAAGGAATTTTTCTTTGGAATCATTTTTGGTTGACATCTCTGTTTTTTGTGGATCAGTTTTTTACTCTTC\n',
 'CACTCTCTTTTCTATATTTTGCCCATCGGGGCTGCGGATACCTGGTTTTATTATTTTTTCTTTGCCCAAC\n',
 'GGGGCCGTGGATACCTGCCTTTTAATTCTTTTTTATTCGCCCATCGGGGCCGCGGATACCTGCTTTTTAT\n',
 'TTTTTTTTCCTTAGCCCATCGGGGTATCGGATACCTGCTGATTCCCTTCCCCTCTGAACCCCCAACACTC\n',
 'TGGCCCATCGGGGTGACGGATATCTGCTTTTTAAAAATTTTCTTTTTTTGGCCCATCGGGGCTTCGGATA\n']

Running rap_probes()

After installing the probeutils package, rap_probes() should be available for import.

[5]:

# Required only for demonstration
import os
import pandas as pd

# Probe generation
from probe_utils import rap_probes

rap_probes() currently accepts eight parameters. The following is copied from the function description:

fasta : str
    Path to a fasta file containing the sequence to
    generate probes against

gene : str
    The name of the target gene, used to name probes
    and the output file

adaptseq : str
    Any nucleotides that should be added to the 5'-end
    of each probe. These are used for ligating probes
    to beads or barcodes. By default, the value is set
    to the first SPRITE barcode. If no adapter is required,
    set this parameter to ''. Default 'CAAGTCA'

probe_length : int
    The total length of the probe in nucleotides. If
    an adaptor is used, this length includes the length
    of the adapter. Default 90

biotin : Bool
    Whether to add a 5'-biotin to the probes. Formatted
    for ordering from Integrated DNA Technologies (IDT).
    Default False

blat : Bool
    Whether to filter probes for multiple genome matches
    using UCSC BLAT. If True, the genome assembly name
    must be supplied to **kwargs. Default True

dfam : Bool
    Whether to filter probes for transposable elements and
    tandem repeats using the Institute of Systems Biology's
    Dfam database. If True, the species name must be supplied
    to **kwargs. Default True

**kwargs : dictionary

    genome : str
        Used for BLAT filtering. Short assembly name for the
        species genome as listed in BLAT, e.g. 'hg38,' 'mm39,'
        or 'dm6'

    tolerance : int
        Used for BLAT filtering. Number of acceptable matches
        to other genomic loci. Default 25

    species : str
        DFAM species to check repeats, e.g. "Homo sapiens",
        "Mus musculus", or "Drosophila melanogaster"

The function exports a five files into a directory called [gene] + _rapProbesOutput/. It also returns a Pandas DataFrame with the final probe sequences and names. The full list of files outputted is in the function description:

output : a Pandas DataFrame
    A dataframe containing the final probes after filtering
    steps. Identical to the Probes.csv file

rapProbesLog.out : a text file
    A text file containing a log of steps taken by the
    rap_probes function

[gene]_[probe_length]ntProbes.csv : a csv file
    A csv file containing the final probes. Identical
    to the Pandas Dataframe ouput

blatFailedProbes.csv : a csv file
    If performing BLAT filtering, a csv file containing BLAT
    results for probes that did not pass filters

blatPassedProbes.csv : a csv file
    If performing BLAT filtering, a csv file containing BLAT
    results for probes that passed filters

dfamFailedProbes.csv : a csv file
    If performing Dfam filtering, a csv file containing Dfam
    results for probes that did not pass filters

In this example, we will generate probes for human Xist. Given that Xist is a unique genomic locus, we want to use both BLAT and DFAM to filter out non-specific probes. If targeting a repetitive element (e.g. LINE1) or a multicopy gene (e.g. U1 snRNA), these filters should be turned off.

The genome BLAT queries is specified by the keyword argument ‘genome’, and the species DFAM searches is determined by the keyword argument ‘species’. Refer to the current builds of the two databases to determine which genomes and species are supported.

[6]:

kwargs = {'genome':'hg38',
         'species':'Homo sapiens'}

With all of this understood we are now ready to run the script. Progress messages will appear if running BLAT or DFAM, and a final message (‘Probe generation complete’) will inform you that the script ran successfully.

[7]:

df = rap_probes(fasta = 'xist.fasta',
               gene = 'HuXist',
               adaptseq = 'CAAGTCA',
               probe_length = 90,
               biotin = False,
               blat = True,
               dfam = True,
               **kwargs)

Starting BLAT

100%|███████████████████████████████████████████| 10/10 [00:45<00:00,  4.52s/it]

BLAT Done
Starting Dfam
Search submitted successfully.
DFAM Done
Probe generation complete

We can now see that all the output files have been sent to HuXist_rapProbesOut/

[8]:

os.listdir('HuXist_rapProbesOutput/')

[8]:

['HuXist_90ntProbes.csv',
 'rapProbesLog.out',
 'blatPassedProbes.csv',
 'dfamFailedProbes.csv',
 '.ipynb_checkpoints',
 'blatFailedProbes.csv']

We can check how many probes were originally generated, and how many BLAT and DFAM filtered.

[9]:

with open('HuXist_rapProbesOutput/rapProbesLog.out','r') as f:
    print(f.read())

Probe Design Log for HuXist
Original probes generated: 233

BLAT Results
Identified locus: chrX:73820651-73852753 (-)
Genome Match: 100.0%
Probes remaining after BLAT: 137

Dfam Results
Search submitted successfully
Dfam search time: 19 seconds
Probes remaining after Dfam: 136

From this we can see that BLAT correctly identified the Xist locus and that the input file had a 100% match to the hg38 genome. If this match had been lower than 100%, the user would have been asked whether to abort the program or continue.

Around 60% of probes remain, which should be acceptable for most RAP applications.

If we are curious about why probes failed, we can look at the probes that failed BLAT and DFAM filtering. In this case, the probe originally named 218 was particularly problematic:

[10]:

blat = pd.read_csv('HuXist_rapProbesOutput/blatFailedProbes.csv')

blat[blat['qName'] == 218]

[10]:

	matches	qName	tName	tStart	tEnd	qStarts	tStarts
983	83	218	chrX	73821753	73821836	0	73821753
984	40	218	chr2	108652932	108653003	35,42,57,77	108652932,108652940,108652952,108652997
985	29	218	chr7	127736479	127736518	33,50	127736479,127736505
986	29	218	chr16	14040243	14040279	26,32,49	14040243,14040254,14040273
987	28	218	chr6	152466529	152466566	22,27,35	152466529,152466533,152466549
988	27	218	chr12	80656242	80656277	2,12	80656242,80656259
989	26	218	chr2	110474587	110474900	0,6	110474587,110474880
990	26	218	chr4	41387480	41387513	32,54	41387480,41387505
991	24	218	chr4	22061022	22061057	25,32	22061022,22061040
992	24	218	chr2	181947660	181947690	59,75	181947660,181947682
993	24	218	chr11	87347049	87347074	43	87347049
994	23	218	chrX	11420280	11420310	21,35	11420280,11420301
995	22	218	chr6	18851074	18851096	28	18851074
996	22	218	chrUn_GL000195v1	96928	96950	33	96928
997	22	218	chr21	6277553	6277575	33	6277553
998	21	218	chr7	123730795	123730816	33	123730795
999	20	218	chr3	157490812	157490832	43	157490812
1000	20	218	chr13	64330699	64330719	32	64330699
1001	20	218	chr1	107901625	107901645	21	107901625
1002	25	218	chr14	27657732	27657762	20	27657732

DFAM only identified a single probe containing a repetitive element, the hAT transposon MER58C:

[11]:

pd.read_csv('HuXist_rapProbesOutput/dfamFailedProbes.csv')

[11]:

	probe	query	type	e_value
0	GGTAAGCTATGAACAGCAGGCCAAATCCAATTGGCTCAAAAACTAA...	MER58C	DNA	0.000034

The final probes are located in the file ‘HuXist_90ntProbes.csv’, but the output of rap_probes() let’s you explore the file directly without importing it:

[12]:

df.head()

[12]:

	Name	Sequence
0	HuXist_0	CAAGTCAATCTTCAGTCAGGAAGCTTCCAGCCCCGAGAGAGTAAGA...
1	HuXist_1	CAAGTCATAGGTGATTTTTTTAAAGAAATACGCCATAAAGGGTGTT...
2	HuXist_2	CAAGTCAAATCTGAACACGCCCTTAGCTTAACTGCAGAGTCATTCT...
3	HuXist_3	CAAGTCAAAAGGGAGTCCATGAGAAGGTGCCCTTATCTAGTACACA...
4	HuXist_4	CAAGTCATACTGCAAATGGAGGGTGAGAAGGTAGAACTTTGTTTAA...

From here, it is easy to convert the .csv file to a format that can be ordered on IDT plates or an array.

Computing Environment

[13]:

%load_ext watermark
%watermark -v -p probe_utils,jupyterlab,os,pandas

Python implementation: CPython
Python version       : 3.9.17
IPython version      : 8.12.0

probe_utils: 1.0.3
jupyterlab : 3.6.3
os         : unknown
pandas     : 2.0.3