{
"cells": [
{
"cell_type": "markdown",
"id": "490cbcc0-4100-4370-929f-3d5715c0fb82",
"metadata": {},
"source": [
"# RAP Probe Generation"
]
},
{
"cell_type": "markdown",
"id": "3af53bf5-f5ae-47f9-9bfc-bf454a0c82d2",
"metadata": {},
"source": [
"This page demonstrates how to generate probe sequences for RNA Antisense Purification (RAP) using the function `rap_probes()`. The function divides a fasta file into fragments of a user-specified length, converts them to their reverse complement, and adds any additional sequences (either an adapter sequence or 5'-biotin). Additionally, the function has the option of filtering probes for multimappers using [BLAT](https://genome.ucsc.edu/cgi-bin/hgBlat), and for repetitive elements using [DFAM](https://www.dfam.org/home).\n",
"\n",
"BLAT filtering in `rap_probes()` ignores multimappers found within the targeted gene. lncRNAs such as Xist and Kcnq1ot1 contain internally repetitive sequences that are found nowhere else in the genome. BLAT will nonetheless identify probes to these regions as multimappers, but all the homologous regions will lie within the intended target. When this situation occurs, `rap_probes()` ignores the multimapping and keeps the probe.\n",
"\n",
"That said, lncRNA research is plagued by studies that fail to note known repetitive elements within annotated lncRNAs. For example, say a scientist performs RAP-DNA on a lncRNA containing a SINE element and does not remove SINE sequences from their probes. Genomic alignment will incorrectly indicate that the lncRNA localizes to SINEs genome-wide. In fact, this result is due to antisense purification of nascent SINE RNAs, which are present in the majority of introns. DFAM filtering reduces the likelihood of this error by removing probes to known selfish elements in a genome.\n",
"\n",
"In general, RAP probes should be 60-120nt long. For protocols using heat elution, shorter probes (60-90nt) are preferable as longer probes reduce the efficiency of melting the RNA-DNA hybrid. If eluting with benzonase or another nuclease, longer probes are acceptable.\n",
"\n",
"I typically recommend using at least 50% coverage of the RNA to ensure efficient capture. For protocols where only a specific part of the RNA is of interest (e.g. RAP-MS for the A-repeat of Xist), it may be possible to tile that specific locus. Be advised, however, that we have not systematically tested this and that it is likely sensitive to the extent of RNA fragmentation. Insufficiently fragmented RNA may be difficult to target with probes to a specific subregion."
]
},
{
"cell_type": "markdown",
"id": "7fbc1259-b2d6-43fe-9926-d594a656e1f4",
"metadata": {},
"source": [
"## Following This Demo"
]
},
{
"cell_type": "markdown",
"id": "43570eed-c3f1-4625-a108-0f8ee7387d98",
"metadata": {},
"source": [
"This notebook and the `xist.fasta` file are available [on Github.](https://github.com/honsonbiosci/rapprobesdemo.git)"
]
},
{
"cell_type": "markdown",
"id": "6f693d2e-2870-4e53-82f1-0d6abaa356f7",
"metadata": {},
"source": [
"## Installation"
]
},
{
"cell_type": "markdown",
"id": "9f1dfd2b-cb68-446e-bfb5-4c320ed5b295",
"metadata": {},
"source": [
"`rap_probes()` is distributed as part of the `probeutils` package (at the moment it is the only function, but this will be expanded in time). Installation is simple using pip:\n",
"\n",
"`pip install probeutils`"
]
},
{
"cell_type": "markdown",
"id": "3232cd66-8fa7-4971-8c26-dc9fb1f40b3f",
"metadata": {},
"source": [
"## Input file"
]
},
{
"cell_type": "markdown",
"id": "bf6ae940-0bbf-496f-b411-314c8492fd4b",
"metadata": {},
"source": [
"The FASTA interpreter in `rap_probes()` accepts either a file containing only DNA bases and newlines, or a file with one or more FASTA sequences separated by headers beginning with '>'. If the file contains multiple sequences, `rap_probes()` will only generate probes for the first entry.\n",
"\n",
"For this example, we will use human Xist downloaded from NCBI:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "ce736e69-9199-44aa-ab99-133f096cd7bd",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"with open('xist.fasta', 'r') as x:\n",
" xist = x.readlines()"
]
},
{
"cell_type": "markdown",
"id": "6fc3cfa8-ec18-4a2c-b915-6a12ab6fc4af",
"metadata": {},
"source": [
"The first line of the file contains the gene identity information:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "eef2c2c0-4b67-4402-9368-53096aba6571",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"'>NR_001564.2 Homo sapiens X inactive specific transcript (XIST), long non-coding RNA\\n'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"xist[0]"
]
},
{
"cell_type": "markdown",
"id": "2fec4aae-d068-4628-876e-d05de5779ab8",
"metadata": {},
"source": [
"The final line is a newline character:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "ec5816bf-991b-4b7f-ae26-054725cd73cc",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"'\\n'"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"xist[-1]"
]
},
{
"cell_type": "markdown",
"id": "503be472-eeb3-4e5b-8b4a-6a5028267a84",
"metadata": {},
"source": [
"
\n",
"And all the lines in between are sequences separated by newlines:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "7cdca765-9db2-459d-be70-dff3d1427e3d",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"['CCTTCAGTTCTTAAAGCGCTGCAATTCGCTGCTGCAGCCATATTTCTTACTCTCTCGGGGCTGGAAGCTT\\n',\n",
" 'CCTGACTGAAGATCTCTCTGCACTTGGGGTTCTTTCTAGAACATTTTCTAGTCCCCCAACACCCTTTATG\\n',\n",
" 'GCGTATTTCTTTAAAAAAATCACCTAAATTCCATAAAATATTTTTTTAAATTCTATACTTTCTCCTAGTG\\n',\n",
" 'TCTTCTTGACACGTCCTCCATATTTTTTTAAAGAAAGTATTTGGAATATTTTGAGGCAATTTTTAATATT\\n',\n",
" 'TAAGGAATTTTTCTTTGGAATCATTTTTGGTTGACATCTCTGTTTTTTGTGGATCAGTTTTTTACTCTTC\\n',\n",
" 'CACTCTCTTTTCTATATTTTGCCCATCGGGGCTGCGGATACCTGGTTTTATTATTTTTTCTTTGCCCAAC\\n',\n",
" 'GGGGCCGTGGATACCTGCCTTTTAATTCTTTTTTATTCGCCCATCGGGGCCGCGGATACCTGCTTTTTAT\\n',\n",
" 'TTTTTTTTCCTTAGCCCATCGGGGTATCGGATACCTGCTGATTCCCTTCCCCTCTGAACCCCCAACACTC\\n',\n",
" 'TGGCCCATCGGGGTGACGGATATCTGCTTTTTAAAAATTTTCTTTTTTTGGCCCATCGGGGCTTCGGATA\\n']"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"xist[1:10]"
]
},
{
"cell_type": "markdown",
"id": "b039c658-0244-423a-936d-80b24c5137d3",
"metadata": {},
"source": [
"## Running rap_probes()"
]
},
{
"cell_type": "markdown",
"id": "aeb1b002-c27c-4a90-b352-94df91655f94",
"metadata": {},
"source": [
"After installing the `probeutils` package, `rap_probes()` should be available for import."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "9477b4bb-15c9-4b8a-bff1-01318f314937",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Required only for demonstration\n",
"import os\n",
"import pandas as pd\n",
"\n",
"# Probe generation\n",
"from probe_utils import rap_probes"
]
},
{
"cell_type": "markdown",
"id": "b9115b29-67dd-4784-94e5-becd7bfc27d9",
"metadata": {},
"source": [
"`rap_probes()` currently accepts eight parameters. The following is copied from the function description:\n",
"\n",
" fasta : str\n",
" Path to a fasta file containing the sequence to \n",
" generate probes against\n",
" \n",
" gene : str\n",
" The name of the target gene, used to name probes\n",
" and the output file\n",
" \n",
" adaptseq : str\n",
" Any nucleotides that should be added to the 5'-end\n",
" of each probe. These are used for ligating probes \n",
" to beads or barcodes. By default, the value is set \n",
" to the first SPRITE barcode. If no adapter is required, \n",
" set this parameter to ''. Default 'CAAGTCA'\n",
" \n",
" probe_length : int\n",
" The total length of the probe in nucleotides. If\n",
" an adaptor is used, this length includes the length\n",
" of the adapter. Default 90\n",
" \n",
" biotin : Bool\n",
" Whether to add a 5'-biotin to the probes. Formatted\n",
" for ordering from Integrated DNA Technologies (IDT).\n",
" Default False\n",
" \n",
" blat : Bool\n",
" Whether to filter probes for multiple genome matches \n",
" using UCSC BLAT. If True, the genome assembly name \n",
" must be supplied to **kwargs. Default True\n",
" \n",
" dfam : Bool\n",
" Whether to filter probes for transposable elements and\n",
" tandem repeats using the Institute of Systems Biology's \n",
" Dfam database. If True, the species name must be supplied\n",
" to **kwargs. Default True\n",
" \n",
" **kwargs : dictionary\n",
" \n",
" genome : str\n",
" Used for BLAT filtering. Short assembly name for the \n",
" species genome as listed in BLAT, e.g. 'hg38,' 'mm39,' \n",
" or 'dm6'\n",
"\n",
" tolerance : int\n",
" Used for BLAT filtering. Number of acceptable matches \n",
" to other genomic loci. Default 25\n",
" \n",
" species : str\n",
" DFAM species to check repeats, e.g. \"Homo sapiens\",\n",
" \"Mus musculus\", or \"Drosophila melanogaster\"\n",
" "
]
},
{
"cell_type": "markdown",
"id": "3778041b-9eed-4f73-adae-eb6573131a2d",
"metadata": {
"tags": []
},
"source": [
"The function exports a five files into a directory called `[gene] + _rapProbesOutput/`. It also returns a Pandas DataFrame with the final probe sequences and names. The full list of files outputted is in the function description:\n",
"\n",
" output : a Pandas DataFrame\n",
" A dataframe containing the final probes after filtering\n",
" steps. Identical to the Probes.csv file\n",
" \n",
" rapProbesLog.out : a text file\n",
" A text file containing a log of steps taken by the \n",
" rap_probes function\n",
" \n",
" [gene]_[probe_length]ntProbes.csv : a csv file\n",
" A csv file containing the final probes. Identical\n",
" to the Pandas Dataframe ouput\n",
" \n",
" blatFailedProbes.csv : a csv file\n",
" If performing BLAT filtering, a csv file containing BLAT\n",
" results for probes that did not pass filters\n",
" \n",
" blatPassedProbes.csv : a csv file\n",
" If performing BLAT filtering, a csv file containing BLAT\n",
" results for probes that passed filters\n",
" \n",
" dfamFailedProbes.csv : a csv file\n",
" If performing Dfam filtering, a csv file containing Dfam\n",
" results for probes that did not pass filters\n",
" "
]
},
{
"cell_type": "markdown",
"id": "67b36b23-ef52-46e7-a620-a2ffb33d64fd",
"metadata": {},
"source": [
"In this example, we will generate probes for human Xist. Given that Xist is a unique genomic locus, we want to use both BLAT and DFAM to filter out non-specific probes. If targeting a repetitive element (e.g. LINE1) or a multicopy gene (e.g. U1 snRNA), these filters should be turned off.\n",
"\n",
"The genome BLAT queries is specified by the keyword argument 'genome', and the species DFAM searches is determined by the keyword argument 'species'. Refer to the current builds of the two databases to determine which genomes and species are supported. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "133ebf89-5b55-4708-b328-83ab7a5762a7",
"metadata": {},
"outputs": [],
"source": [
"kwargs = {'genome':'hg38',\n",
" 'species':'Homo sapiens'}"
]
},
{
"cell_type": "markdown",
"id": "dcce2676-7f92-46cb-b25f-21b7c277d005",
"metadata": {},
"source": [
"With all of this understood we are now ready to run the script. Progress messages will appear if running BLAT or DFAM, and a final message ('Probe generation complete') will inform you that the script ran successfully."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "ffa855ef-bd74-4bfc-979d-d0823f1543af",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Starting BLAT\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|███████████████████████████████████████████| 10/10 [00:45<00:00, 4.52s/it]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"BLAT Done\n",
"Starting Dfam\n",
"Search submitted successfully.\n",
"DFAM Done\n",
"Probe generation complete\n"
]
}
],
"source": [
"df = rap_probes(fasta = 'xist.fasta',\n",
" gene = 'HuXist',\n",
" adaptseq = 'CAAGTCA',\n",
" probe_length = 90,\n",
" biotin = False,\n",
" blat = True,\n",
" dfam = True,\n",
" **kwargs)"
]
},
{
"cell_type": "markdown",
"id": "3e799edc-46c2-4051-95fe-47c09c27bb96",
"metadata": {},
"source": [
"We can now see that all the output files have been sent to `HuXist_rapProbesOut/`"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "d1eebd9d-e549-4fcf-b871-cb321c0e3cc2",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"['HuXist_90ntProbes.csv',\n",
" 'rapProbesLog.out',\n",
" 'blatPassedProbes.csv',\n",
" 'dfamFailedProbes.csv',\n",
" '.ipynb_checkpoints',\n",
" 'blatFailedProbes.csv']"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"os.listdir('HuXist_rapProbesOutput/')"
]
},
{
"cell_type": "markdown",
"id": "77449ee1-3003-4a86-bef2-1fb33edb91c2",
"metadata": {},
"source": [
"We can check how many probes were originally generated, and how many BLAT and DFAM filtered."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "30795bb2-de27-4fa4-b63d-b95306e9a56e",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Probe Design Log for HuXist\n",
"Original probes generated: 233\n",
"\n",
"BLAT Results\n",
"Identified locus: chrX:73820651-73852753 (-)\n",
"Genome Match: 100.0%\n",
"Probes remaining after BLAT: 137\n",
"\n",
"Dfam Results\n",
"Search submitted successfully\n",
"Dfam search time: 19 seconds\n",
"Probes remaining after Dfam: 136\n",
"\n"
]
}
],
"source": [
"with open('HuXist_rapProbesOutput/rapProbesLog.out','r') as f:\n",
" print(f.read())"
]
},
{
"cell_type": "markdown",
"id": "26d65e02-c037-4c2f-960c-c5a74a1586b9",
"metadata": {},
"source": [
"From this we can see that BLAT correctly identified the Xist locus and that the input file had a 100% match to the hg38 genome. If this match had been lower than 100%, the user would have been asked whether to abort the program or continue.\n",
"\n",
"Around 60% of probes remain, which should be acceptable for most RAP applications. "
]
},
{
"cell_type": "markdown",
"id": "899a7444-06a8-4295-b81d-14948f567f58",
"metadata": {},
"source": [
"If we are curious about why probes failed, we can look at the probes that failed BLAT and DFAM filtering. In this case, the probe originally named 218 was particularly problematic:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "03efa06a-07a3-4d69-98ab-256a7b748475",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" matches | \n",
" qName | \n",
" tName | \n",
" tStart | \n",
" tEnd | \n",
" qStarts | \n",
" tStarts | \n",
"
\n",
" \n",
" \n",
" \n",
" 983 | \n",
" 83 | \n",
" 218 | \n",
" chrX | \n",
" 73821753 | \n",
" 73821836 | \n",
" 0 | \n",
" 73821753 | \n",
"
\n",
" \n",
" 984 | \n",
" 40 | \n",
" 218 | \n",
" chr2 | \n",
" 108652932 | \n",
" 108653003 | \n",
" 35,42,57,77 | \n",
" 108652932,108652940,108652952,108652997 | \n",
"
\n",
" \n",
" 985 | \n",
" 29 | \n",
" 218 | \n",
" chr7 | \n",
" 127736479 | \n",
" 127736518 | \n",
" 33,50 | \n",
" 127736479,127736505 | \n",
"
\n",
" \n",
" 986 | \n",
" 29 | \n",
" 218 | \n",
" chr16 | \n",
" 14040243 | \n",
" 14040279 | \n",
" 26,32,49 | \n",
" 14040243,14040254,14040273 | \n",
"
\n",
" \n",
" 987 | \n",
" 28 | \n",
" 218 | \n",
" chr6 | \n",
" 152466529 | \n",
" 152466566 | \n",
" 22,27,35 | \n",
" 152466529,152466533,152466549 | \n",
"
\n",
" \n",
" 988 | \n",
" 27 | \n",
" 218 | \n",
" chr12 | \n",
" 80656242 | \n",
" 80656277 | \n",
" 2,12 | \n",
" 80656242,80656259 | \n",
"
\n",
" \n",
" 989 | \n",
" 26 | \n",
" 218 | \n",
" chr2 | \n",
" 110474587 | \n",
" 110474900 | \n",
" 0,6 | \n",
" 110474587,110474880 | \n",
"
\n",
" \n",
" 990 | \n",
" 26 | \n",
" 218 | \n",
" chr4 | \n",
" 41387480 | \n",
" 41387513 | \n",
" 32,54 | \n",
" 41387480,41387505 | \n",
"
\n",
" \n",
" 991 | \n",
" 24 | \n",
" 218 | \n",
" chr4 | \n",
" 22061022 | \n",
" 22061057 | \n",
" 25,32 | \n",
" 22061022,22061040 | \n",
"
\n",
" \n",
" 992 | \n",
" 24 | \n",
" 218 | \n",
" chr2 | \n",
" 181947660 | \n",
" 181947690 | \n",
" 59,75 | \n",
" 181947660,181947682 | \n",
"
\n",
" \n",
" 993 | \n",
" 24 | \n",
" 218 | \n",
" chr11 | \n",
" 87347049 | \n",
" 87347074 | \n",
" 43 | \n",
" 87347049 | \n",
"
\n",
" \n",
" 994 | \n",
" 23 | \n",
" 218 | \n",
" chrX | \n",
" 11420280 | \n",
" 11420310 | \n",
" 21,35 | \n",
" 11420280,11420301 | \n",
"
\n",
" \n",
" 995 | \n",
" 22 | \n",
" 218 | \n",
" chr6 | \n",
" 18851074 | \n",
" 18851096 | \n",
" 28 | \n",
" 18851074 | \n",
"
\n",
" \n",
" 996 | \n",
" 22 | \n",
" 218 | \n",
" chrUn_GL000195v1 | \n",
" 96928 | \n",
" 96950 | \n",
" 33 | \n",
" 96928 | \n",
"
\n",
" \n",
" 997 | \n",
" 22 | \n",
" 218 | \n",
" chr21 | \n",
" 6277553 | \n",
" 6277575 | \n",
" 33 | \n",
" 6277553 | \n",
"
\n",
" \n",
" 998 | \n",
" 21 | \n",
" 218 | \n",
" chr7 | \n",
" 123730795 | \n",
" 123730816 | \n",
" 33 | \n",
" 123730795 | \n",
"
\n",
" \n",
" 999 | \n",
" 20 | \n",
" 218 | \n",
" chr3 | \n",
" 157490812 | \n",
" 157490832 | \n",
" 43 | \n",
" 157490812 | \n",
"
\n",
" \n",
" 1000 | \n",
" 20 | \n",
" 218 | \n",
" chr13 | \n",
" 64330699 | \n",
" 64330719 | \n",
" 32 | \n",
" 64330699 | \n",
"
\n",
" \n",
" 1001 | \n",
" 20 | \n",
" 218 | \n",
" chr1 | \n",
" 107901625 | \n",
" 107901645 | \n",
" 21 | \n",
" 107901625 | \n",
"
\n",
" \n",
" 1002 | \n",
" 25 | \n",
" 218 | \n",
" chr14 | \n",
" 27657732 | \n",
" 27657762 | \n",
" 20 | \n",
" 27657732 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" matches qName tName tStart tEnd qStarts \\\n",
"983 83 218 chrX 73821753 73821836 0 \n",
"984 40 218 chr2 108652932 108653003 35,42,57,77 \n",
"985 29 218 chr7 127736479 127736518 33,50 \n",
"986 29 218 chr16 14040243 14040279 26,32,49 \n",
"987 28 218 chr6 152466529 152466566 22,27,35 \n",
"988 27 218 chr12 80656242 80656277 2,12 \n",
"989 26 218 chr2 110474587 110474900 0,6 \n",
"990 26 218 chr4 41387480 41387513 32,54 \n",
"991 24 218 chr4 22061022 22061057 25,32 \n",
"992 24 218 chr2 181947660 181947690 59,75 \n",
"993 24 218 chr11 87347049 87347074 43 \n",
"994 23 218 chrX 11420280 11420310 21,35 \n",
"995 22 218 chr6 18851074 18851096 28 \n",
"996 22 218 chrUn_GL000195v1 96928 96950 33 \n",
"997 22 218 chr21 6277553 6277575 33 \n",
"998 21 218 chr7 123730795 123730816 33 \n",
"999 20 218 chr3 157490812 157490832 43 \n",
"1000 20 218 chr13 64330699 64330719 32 \n",
"1001 20 218 chr1 107901625 107901645 21 \n",
"1002 25 218 chr14 27657732 27657762 20 \n",
"\n",
" tStarts \n",
"983 73821753 \n",
"984 108652932,108652940,108652952,108652997 \n",
"985 127736479,127736505 \n",
"986 14040243,14040254,14040273 \n",
"987 152466529,152466533,152466549 \n",
"988 80656242,80656259 \n",
"989 110474587,110474880 \n",
"990 41387480,41387505 \n",
"991 22061022,22061040 \n",
"992 181947660,181947682 \n",
"993 87347049 \n",
"994 11420280,11420301 \n",
"995 18851074 \n",
"996 96928 \n",
"997 6277553 \n",
"998 123730795 \n",
"999 157490812 \n",
"1000 64330699 \n",
"1001 107901625 \n",
"1002 27657732 "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"blat = pd.read_csv('HuXist_rapProbesOutput/blatFailedProbes.csv')\n",
"\n",
"blat[blat['qName'] == 218]"
]
},
{
"cell_type": "markdown",
"id": "4561b33b-2a74-485e-b3ec-70a658304878",
"metadata": {
"tags": []
},
"source": [
"DFAM only identified a single probe containing a repetitive element, the hAT transposon MER58C: "
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "106f3b89-e518-43c8-bba2-b5db1d016d1b",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" probe | \n",
" query | \n",
" type | \n",
" e_value | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" GGTAAGCTATGAACAGCAGGCCAAATCCAATTGGCTCAAAAACTAA... | \n",
" MER58C | \n",
" DNA | \n",
" 0.000034 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" probe query type e_value\n",
"0 GGTAAGCTATGAACAGCAGGCCAAATCCAATTGGCTCAAAAACTAA... MER58C DNA 0.000034"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_csv('HuXist_rapProbesOutput/dfamFailedProbes.csv')"
]
},
{
"cell_type": "markdown",
"id": "dbd100e4-f229-4e83-ac8d-ef6737d67a36",
"metadata": {},
"source": [
"The final probes are located in the file 'HuXist_90ntProbes.csv', but the output of `rap_probes()` let's you explore the file directly without importing it:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "87587bba-493b-4a07-b8cf-0b3d1fcbc2ea",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Name | \n",
" Sequence | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" HuXist_0 | \n",
" CAAGTCAATCTTCAGTCAGGAAGCTTCCAGCCCCGAGAGAGTAAGA... | \n",
"
\n",
" \n",
" 1 | \n",
" HuXist_1 | \n",
" CAAGTCATAGGTGATTTTTTTAAAGAAATACGCCATAAAGGGTGTT... | \n",
"
\n",
" \n",
" 2 | \n",
" HuXist_2 | \n",
" CAAGTCAAATCTGAACACGCCCTTAGCTTAACTGCAGAGTCATTCT... | \n",
"
\n",
" \n",
" 3 | \n",
" HuXist_3 | \n",
" CAAGTCAAAAGGGAGTCCATGAGAAGGTGCCCTTATCTAGTACACA... | \n",
"
\n",
" \n",
" 4 | \n",
" HuXist_4 | \n",
" CAAGTCATACTGCAAATGGAGGGTGAGAAGGTAGAACTTTGTTTAA... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name Sequence\n",
"0 HuXist_0 CAAGTCAATCTTCAGTCAGGAAGCTTCCAGCCCCGAGAGAGTAAGA...\n",
"1 HuXist_1 CAAGTCATAGGTGATTTTTTTAAAGAAATACGCCATAAAGGGTGTT...\n",
"2 HuXist_2 CAAGTCAAATCTGAACACGCCCTTAGCTTAACTGCAGAGTCATTCT...\n",
"3 HuXist_3 CAAGTCAAAAGGGAGTCCATGAGAAGGTGCCCTTATCTAGTACACA...\n",
"4 HuXist_4 CAAGTCATACTGCAAATGGAGGGTGAGAAGGTAGAACTTTGTTTAA..."
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"id": "f0f3b587-ad37-4159-8aa5-527e247a85e7",
"metadata": {},
"source": [
"From here, it is easy to convert the `.csv` file to a format that can be ordered on IDT plates or an array."
]
},
{
"cell_type": "markdown",
"id": "0f0994b5-3d4d-43e1-a426-218c98972b36",
"metadata": {},
"source": [
"## Computing Environment"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "8b78760b-41eb-4fbb-a826-6b1c939fd65d",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Python implementation: CPython\n",
"Python version : 3.9.17\n",
"IPython version : 8.12.0\n",
"\n",
"probe_utils: 1.0.3\n",
"jupyterlab : 3.6.3\n",
"os : unknown\n",
"pandas : 2.0.3\n",
"\n"
]
}
],
"source": [
"%load_ext watermark\n",
"%watermark -v -p probe_utils,jupyterlab,os,pandas"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.17"
}
},
"nbformat": 4,
"nbformat_minor": 5
}