---
title: "Preparing epiRADseq data"
author: "Marco Crotti"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
  github_document:
    toc: true
    toc_depth: 5
fontsize: 12pt
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Pipeline for creating read count data for epiRADseq analysis

EpiRADseq is technique developed to study DNA methylation and is very similar to ddRADseq. In fact, same as ddRADseq, it works both with and without 
a reference genome. Here I am showing how to extract read counts of the genome-referenced loci from the
[STACKS](http://catchenlab.life.illinois.edu/stacks/) pipeline. 


### Building the Stacks catalog

The whole process of demultiplexing raw reads, quality control and trimming, and genome alignment is described in another file. Assuming you have good
quality bam files after aligning your reads to a reference genome, you can use the `ref_map.pl` script to build a catalog with reference aligned reads.

```{bash, eval = FALSE}
ref_map.pl -T 4 --samples ./04.bam_alignments/ -o ./05.Stacks --popmap ./popmap.txt

```

The folder `04.bam_alignments` containes the bam files, while `05.Stacks` containes all the output from Stacks. Now you have a catalog of 
genome-referenced loci, which can be used to compare methylation status between populations/experimental treatments. To do so, you need to extract
the number of reads per locus each individual in your dataset possesses.


### Align fastq reads to Stacks catalog

Before counting the number of reads per locus per individual, you first need to align your reads to the Stacks catalog loci. This can be done easily 
with `bwa`. Here is an example below:

```{bash, eval = FALSE}
# create an index of the loci to be used by bwa
bwa index ./05.Stacks/catalog.assembly.fasta

# use a loop in bash to align all files
for R1 in ./02.Trimmomatic_filtering/*ELT*_P1*
do
R2=${R1//.1_P1.fq.gz/.2_P2.fq.gz}
base=$(basename $R1 .1_P1.fq.gz)
echo "$base $R1 $R2"
bwa mem -t 8 ./05.Stacks/catalog.assembly.fasta $R1 $R2 | \
samtools view -bSq - | \
samtools sort - > ./06.Stacks/epiRAD_analysis/$base.bam
done

# index the bam files
for bam in ./06.Stacks/epiRAD_analysis/*.bam
do
samtools index $bam
done

```


### Extract read counts


Now you can use samtools to extract the number of reads per locus per individual.

```{bash, eval = FALSE}
for i in ./06.Stacks/epiRAD_analysis/*.bam
do
base=$(basename $i .bam)
echo "Outputting read counts: "$i
samtools idxstats $i | cut -f 1,3 > ./06.Stacks/epiRAD_analysis/$base\_depth.txt
done
```

The depth.txt file now containes the number of reads per locus. You can use this `awk` script to combine all depth.txt files from all samples into 
a single table.

```{bash, eval = FALSE}
awk '{ a[FNR] = (a[FNR] ? a[FNR] FS : "") $2 } END { for(i=1;i<=FNR;i++) print a[i] }' $(ls -1v ./06.Stacks/epiRAD_analysis/*_depth.txt) > ./06.Stacks/epiRAD_analysis/reads_count_replicates.txt

```