--- title: "Preparing epiRADseq data" author: "Marco Crotti" date: "`r format(Sys.time(), '%d %B, %Y')`" output: github_document: toc: true toc_depth: 5 fontsize: 12pt --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Pipeline for creating read count data for epiRADseq analysis EpiRADseq is technique developed to study DNA methylation and is very similar to ddRADseq. In fact, same as ddRADseq, it works both with and without a reference genome. Here I am showing how to extract read counts of the genome-referenced loci from the [STACKS](http://catchenlab.life.illinois.edu/stacks/) pipeline. ### Building the Stacks catalog The whole process of demultiplexing raw reads, quality control and trimming, and genome alignment is described in another file. Assuming you have good quality bam files after aligning your reads to a reference genome, you can use the `ref_map.pl` script to build a catalog with reference aligned reads. ```{bash, eval = FALSE} ref_map.pl -T 4 --samples ./04.bam_alignments/ -o ./05.Stacks --popmap ./popmap.txt ``` The folder `04.bam_alignments` containes the bam files, while `05.Stacks` containes all the output from Stacks. Now you have a catalog of genome-referenced loci, which can be used to compare methylation status between populations/experimental treatments. To do so, you need to extract the number of reads per locus each individual in your dataset possesses. ### Align fastq reads to Stacks catalog Before counting the number of reads per locus per individual, you first need to align your reads to the Stacks catalog loci. This can be done easily with `bwa`. Here is an example below: ```{bash, eval = FALSE} # create an index of the loci to be used by bwa bwa index ./05.Stacks/catalog.assembly.fasta # use a loop in bash to align all files for R1 in ./02.Trimmomatic_filtering/*ELT*_P1* do R2=${R1//.1_P1.fq.gz/.2_P2.fq.gz} base=$(basename $R1 .1_P1.fq.gz) echo "$base $R1 $R2" bwa mem -t 8 ./05.Stacks/catalog.assembly.fasta $R1 $R2 | \ samtools view -bSq - | \ samtools sort - > ./06.Stacks/epiRAD_analysis/$base.bam done # index the bam files for bam in ./06.Stacks/epiRAD_analysis/*.bam do samtools index $bam done ``` ### Extract read counts Now you can use samtools to extract the number of reads per locus per individual. ```{bash, eval = FALSE} for i in ./06.Stacks/epiRAD_analysis/*.bam do base=$(basename $i .bam) echo "Outputting read counts: "$i samtools idxstats $i | cut -f 1,3 > ./06.Stacks/epiRAD_analysis/$base\_depth.txt done ``` The depth.txt file now containes the number of reads per locus. You can use this `awk` script to combine all depth.txt files from all samples into a single table. ```{bash, eval = FALSE} awk '{ a[FNR] = (a[FNR] ? a[FNR] FS : "") $2 } END { for(i=1;i<=FNR;i++) print a[i] }' $(ls -1v ./06.Stacks/epiRAD_analysis/*_depth.txt) > ./06.Stacks/epiRAD_analysis/reads_count_replicates.txt ```