---
title: "Biogeography of European whitefish"
author: "Marco Crotti"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: 
  github_document:
    toc: true
    toc_depth: 5
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Pipeline for UK whitefish biogeography project using ddRADseq

### Set up working environment
Let's create the folders that are going to contain the output data from all the pipeline components.

```{bash, eval = FALSE}
mkdir ./Desktop/biogeography
mkdir ./Desktop/biogeography/00.Raw_reads ./Desktop/biogeography/01.Demultiplexed_reads ./Desktop/biogeography/02.Trimmomatic_filtering ./Desktop/biogeography/03.Assembly ./Desktop/biogeography/04.bam_alignments ./Desktop/biogeography/05.Stacks ./Desktop/biogeography/06.Phylogenetics ./Desktop/biogeography/07.Population_genetics
```

### Preparing the data for population genomics analyses
We are going to demultiplex raw reads, filter low reads out, align to reference genome, and build a STACKS catalog.

#### 01. Demultiplex raw reads
Use [process_radtags](http://catchenlab.life.illinois.edu/stacks/comp/process_radtags.php) to demultiplex Illumina raw data.

```{bash, eval = FALSE}
process_radtags -P -c -q -r -p ./00.Raw_reads/ -o ./01.Demultiplexed_reads -b ./biogeography_barcodes.txt --inline_inline -i gzfastq -y gzfastq --renz_1 pstI --renz_2 mspI -t 65

```

#### 02. [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) filtering

Remove the first 5 bp and 3 bp from the forward and reverse reads to remove the enzyme cut site, single end reads.

```{bash, eval = FALSE}
# Forward reads
for infile in ./01.Demultiplexed_reads/*.1.fq.gz
do
base=$(basename $infile .fq.gz)
java -jar /usr/local/bin/trimmomatic-0.38.jar SE -threads 4 $infile ./02.Trimmomatic_filtering/$base.fq.gz HEADCROP:5
done

# Reverse reads
for infile in ./01.Demultiplexed_reads/*.2.fq.gz
do
base=$(basename $infile .fq.gz)
java -jar /usr/local/bin/trimmomatic-0.38.jar SE -threads 4 $infile ./02.Trimmomatic_filtering/$base.fq.gz HEADCROP:3
done
```


Do paired-end filtering.

```{bash, eval = FALSE}
for R1 in *.1.fq.gz
do
R2=${R1//1.fq.gz/2.fq.gz}
R1paired=${R1//.1.fq.gz/.P1.fq.gz}
R1unpaired=${R1//.1.fq.gz/.U1.fq.gz}	
R2paired=${R2//.2.fq.gz/.P2.fq.gz}
R2unpaired=${R2//.2.fq.gz/.U2.fq.gz}
echo "$R1 $R2"
java -jar /usr/local/bin/trimmomatic-0.38.jar PE -threads 4 -phred33 $R1 $R2 ./02.Trimmomatic_filtering/$R1paired $R1unpaired ./02.Trimmomatic_filtering/$R2paired $R2unpaired LEADING:20 TRAILING:20 MINLEN:60
done
```

#### 03. Align to European whitefish genome assembly
We are using [bwa](http://bio-bwa.sourceforge.net/) aligner for short reads.

```{bash, eval = FALSE}
for R1 in ./02.Trimmomatic_filtering/*.P1.fq.gz
do
R2=${R1//.P1.fq.gz/.P2.fq.gz}
base=$(basename $R1 .P1.fq.gz)
echo "$base"
bwa mem -t 4 /03.Assembly/EW_assembly.fa $R1 $R2 | samtools view -bSq 20 | \
samtools sort -o ./04.bam_alignments/$base.bam
done
```


#### 04. Build Stacks catalog

For this project we used [STACKS v.2.4.1](http://catchenlab.life.illinois.edu/stacks/). We are using the `ref_map.pl` script to build a catalog with referenced aligned reads.

```{bash, eval = FALSE}
ref_map.pl -T 4 --samples ./04.bam_alignments/ -o ./05.Stacks --popmap ./popmap_biogeography.txt
```


### Population genomics and phylogenetics analyses 

From this point onward, we are filtering and generating datasets for different analyses.

#### Generate a vcf file using [populations](http://catchenlab.life.illinois.edu/stacks/comp/populations.php)

Here we are telling `populations` to retain loci only if present in at least 80% of individuals per population in at least 9 populations. We are also removing loci with heterozygosity higher than 0.6 and minor allele frequency below 0.05. Only one snp per locus is retained to reduce the effect of linkage.

```{bash, eval = FALSE}
populations -P ./05.Stacks -M ./popmap_biogeography2.txt -t 4 -p 9 -r 0.8 --max_obs_het 0.6 --min_maf 0.05 --write_single_snp --vcf
```

##### Filter the vcf file

Here we are using vcftools to further filter the vcf file: minimum SNP depth of 5, minimum mean SNP depth of 10, maximum mean depth of 30, maf of 0.05, 85% of individuals need to have the site.

```{bash, eval = FALSE}
vcftools --vcf ./05.Stacks/populations.snps.vcf --minDP 5 --maxDP 30 --min-meanDP 10 --max-meanDP 30 --maf 0.05 --max-missing 0.85 --recode --recode-INFO-all --out ./5.Stacks/filtered.maf5
```

We then use the `filter_hwe_by_pop.pl` script from the [dDocent](http://www.ddocent.com/) pipeline. This script removes sites that are not in HWE within populations.

```{bash, eval = FALSE}
./filter_hwe_by_pop.pl -v ./05.Stacks/filtered.maf5.recode.vcf -p ./popmap_biogeography2.txt -o ./05.Stacks/filtered.hwe.maf5
```

We found no loci out of HWE. Next, following an example from `dDocent`, we look for individuals with lots of missing data and exclude from the vcf file.

```{bash, eval = FALSE}
vcftools --vcf ./05.Stacks/filtered.hwe.recode.vcf --missing-indv
mawk '$5 > 0.9' out.imiss | cut -f1 > lowDP.indv
vcftools --vcf ./05.Stacks/filtered.hwe.maf5.recode.vcf --remove lowDP.indv --recode --recode-INFO-all --out ./05.Stacks/filtered.final.maf5
```

We now generated the final vcf file, with 91 individuals and 25,751 high quality SNPs.

#### Genetic diversity and allelic richness

##### Genetic diversity

We calculated expected and observed heterozygosity, and nucleotide diversity using `populations`. We extracted the loci name from the final vcf, and 
used them as a whitelist.

```{bash, eval = FALSE}
cd ~/Desktop/biogeography/07.Population_genetics/genetic_diversity

cut -f 3 filtered.final.maf5.recode.vcf | tail -n +16 | cut -f 1 -d":" > whitelist.txt

cd ~/Desktop/biogeography/

populations -P ./05.Stacks -O ./07.Population_genetics/genetic_diversity --whitelist ./07.Population_genetics/genetic_diversity/whitelist.txt -M ./vcf_pop_file.txt -t 4 -p 8 -r 0.667 --max_obs_het 0.6 --min_maf 0.05 --smooth --vcf --bootstrap
```


##### Allelic richness

Rarefied allelic richness was calculated in the R package `hierfstat`. We converted the final vcf to structure format using **PGDSpider**, imported the
data into R using `adegenet`.

```{r, eval = FALSE}
library(adegenet); library(hierfstat)

setwd("~/Desktop/biogeography/07.Population_genetics/genetic_diversity/")

genind1 <- read.structure("populations.snps.str", n.ind = 91, n.loc = 41929, 
                          onerowperind = FALSE, col.lab = 1, 
                          NA.char = "-9", ask = FALSE, 
                          row.marknames = 1, quiet = FALSE,col.pop = 2)  
whitefish.hfstat <- genind2hierfstat(genind1, pop = genind1$pop)

AR <- allelic.richness(whitefish.hfstat, min.n = 3)

alp.rich <- mean(na.omit(AR$Ar[,1]))
bal.rich <- mean(na.omit(AR$Ar[,2]))
bwa.rich <- mean(na.omit(AR$Ar[,3]))
eck.rich <- mean(na.omit(AR$Ar[,4]))
nor.rich <- mean(na.omit(AR$Ar[,5]))
hwa.rich <- mean(na.omit(AR$Ar[,6]))
lte.rich <- mean(na.omit(AR$Ar[,7]))
lom.rich <- mean(na.omit(AR$Ar[,8]))
rta.rich <- mean(na.omit(AR$Ar[,9]))
rus.rich <- mean(na.omit(AR$Ar[,10]))
uwa.rich <- mean(na.omit(AR$Ar[,11]))
```


#### Convert the vcf file to phylip 

Here we are converting the final vcf file to phylip format to be analysed with [RAxML](https://cme.h-its.org/exelixis/web/software/raxml/) on the [CIPRES](https://www.phylo.org/) server. For the format conversion we use the `vcf2phylip.py` script (Ortiz, 2019), found [here](https://github.com/edgardomortiz/vcf2phylip).

```{bash, eval = FALSE}
cd ./Desktop/biogeography/06.Phylogenetics
python vcf2phylip.py --input filtered.final.maf5.recode.vcf 
```


#### Admxiture analysis

Here we are converting the final vcf file to [plink](zzz.bwh.harvard.edu/plink/) *bed* format to be used in [Admixture](http://software.genetics.ucla.edu/admixture/).

```{bash, eval = FALSE}
plink --vcf ./05.Stacks/filtered.final.maf5.recode.vcf --double-id --allow-extra-chr --make-bed --out ./07.Population_genetics/admixture/filtered.final

plink --bfile ./07.Population_genetics/admixture/filtered.final --double-id --allow-extra-chr --recode12 --tab --out ./07.Population_genetics/admixture/filtered.final
```

Now we can run the **admixture** analysis. First, write a loop script to test different values of *K*. You can use any text editor. The script looks like this:

```{bash, eval = FALSE}
#!/bin/bash

for K in {2..11}
do
admixture --cv=20 $@ $K | tee log${K}.out;
mv *.P ./results;
mv *.Q ./results;
mv *.out ./results;
done

```

Now run the analysis.

```{bash, eval = FALSE}
cd ./07.Population_genetics/admixture
sh ./run_admixture_loop.sh filtered.final.ped
grep -h CV ./results/log*.out

# plot results in R.
```


#### DAPC analysis in adegenet

First, we need to convert the final vcf file into plink 012 format.

```{bash, eval = FALSE}
plink --vcf ./05.Stacks/filtered.final.maf5.recode.vcf --make-bed --out ./07.Population_genetics/DAPC/filtered.maf5 --allow-extra-chr 

plink --bfile ./07.Population_genetics/DAPC/iltered.maf5 --recode --tab --out ./07.Population_genetics/DAPC/filtered.maf5 --allow-extra-chr 

plink --file ./07.Population_genetics/DAPC/filtered.maf5 --recodeA --out ./07.Population_genetics/DAPC/filtered.maf5 --allow-extra-chr 

plink --file ./07.Population_genetics/DAPC/filtered.maf5 --recode12 --out ./07.Population_genetics/DAPC/filtered.maf5.12 --double-id --allow-extra-chr
```

Now we can run the [DAPC](http://adegenet.r-forge.r-project.org/files/tutorial-dapc.pdf) analysis in the R package adegenet.

```{r, eval = FALSE}
setwd("~/Desktop/biogeography/07.Population_genetics/DAPC/")
library(adegenet)
genlight1 <- read.PLINK("filtered.maf5.raw", map.file="filtered.maf5.map")

groups <- find.clusters(genlight1, max.n.clust=11, n.pca = 90,
                        choose.n.clust = TRUE, criterion = "min")  # find most likely number of clusters

xval1 <- xvalDapc(tab(genlight1, NA.method="mean"), groups$grp, n.pca.max = 90,
                  result = "groupMean", center = TRUE, scale = FALSE, parallel = "multicore",
                  n.pca = NULL, n.rep = 100, xval.plot = TRUE)  # crossalidation to decide how many pcs to retain

dapc1 <- dapc(genlight1, pop = groups$grp, n.pca=10, n.da = 7)  # run dapc

# plot the results (Figure S2)
posterior1 <- data.frame(dapc1$posterior)
colnames(posterior1) <- c("alp","lte","lom","haw","nor","bwa","bal")
cols <- c("alp"="#874000","lte"="gold1","lom"="#2372CD","haw"="#DC8C91","nor"="#b08ea2","bwa"="#e3655b","bal"="#6b9080")

barplot(t(posterior1), col = cols, space = 0,
        xlab="Lake", ylab="Ancestry", border=NA,cex.names = 0.000001)
# the colours were then adjusted in inkscape manually!

```


#### fineRADstructure analysis

We generated a dataset in the `radpainter` format to be used by [fineRADstructure](https://github.com/millanek/fineRADstructure) using `populations`.

```{bash, eval = FALSE}
populations -P ./05.Stacks -O ./07.Population_genetics/fineRadStructure -M ./popmap_radstructure.txt --whitelist ./05.Stacks/whitelist -t 4 -p 9 -r 0.9 --max_obs_het 0.6 --min_maf 0.05 --radpainter
```

Then we run the analysis as indicated in the manual [here](http://cichlid.gurdon.cam.ac.uk/fineRADstructure.html).

```{bash , eval = FALSE}
cd ~/Desktop/biogeography/07.Population_genetics/fineRadStructure
RADpainter paint populations.haps.radpainter
finestructure -x 300000 -y 300000 -z 3000 populations.haps_chunks.out populations.haps_chunks.mcmc.xml
finestructure -m T -x 300000 populations.haps_chunks.out populations.haps_chunks.mcmc.xml populations.haps_chunks.mcmcTree.xml
```

Results are plotted using the `fineRADstructurePlot.R` script that is installed with the program.


#### Principal component analysis using SNPRelate

As the title indicates, a pca was run using the R Bioconductor 
package `SNPRelate` (a good vignette is found [here](http://corearray.sourceforge.net/tutorials/SNPRelate/)).

```{r, eval = FALSE}

#### Principal component analysis ####

# Packages ----
library(SNPRelate);library(ggplot2);library(tidyverse);library(gridExtra);library(MASS);library(viridis)

# Set working directory ----
setwd("~/Desktop/biogeography")

# Functions ----

# Consistent plot aesthetics for PCA
theme.pca <- function() {
  theme_bw() +
    theme(panel.grid.minor = element_blank(),
          panel.grid.major = element_blank(),
          panel.background = element_rect(colour="black",fill="white", size=1),
          axis.text = element_text(size=16, color = "black"),
          axis.ticks = element_line(size = 0.5, colour = "black"),
          axis.ticks.length = unit(3, "mm"),
          axis.title.y = element_text(size = 30),
          axis.title.x = element_text(size = 30),
          axis.text.x = element_text(size=20),
          axis.text.y = element_text(size=20),
          legend.title = element_text(size = 20),
          legend.text = element_text(size = 20))
}

# Import vcf file and popdata file ----
vcf.fn <- "./7.Population_genetics/PCA/filtered.final.maf5.recode.vcf"
snpgdsVCF2GDS(vcf.fn, "./7.Population_genetics/PCA/test.final.maf5.gds", method="biallelic.only")
snpgdsSummary("./7.Population_genetics/PCA/test.final.maf5.gds")
genofile <- snpgdsOpen("./7.Population_genetics/PCA/test.final.maf5.gds")


pop_data <- read.table("vcf_pop_file.txt",header=FALSE)


# Start analysis ----
set.seed(1000)  # for reproducibility

# option to filter SNPs for LD before pca
snpset <- snpgdsLDpruning(genofile, ld.threshold=0.2,autosome.only = FALSE)  
snpset.id <- unlist(snpset)
pca <- snpgdsPCA(genofile, num.thread=2, snp.id=snpset.id, autosome.only = FALSE)

# pca with all SNPs
pca <- snpgdsPCA(genofile, num.thread=2, autosome.only = FALSE)  

# variance proportion (%)
pc.percent <- pca$varprop*100
pc.percent <- head(round(pc.percent, 2))

# Manipulate results ----
# make a data.frame
tab <- data.frame(sample.id = pca$sample.id,
                  EV1 = pca$eigenvect[,1],    # the first eigenvector
                  EV2 = pca$eigenvect[,2],
                  EV3 = pca$eigenvect[,3],
                  EV4 = pca$eigenvect[,4],
                  EV5 = pca$eigenvect[,5],
                  EV6 = pca$eigenvect[,6],
                  EV7 = pca$eigenvect[,7],# the second eigenvector
                  stringsAsFactors = FALSE)
head(tab)
# add population data
tab[,9] <- pop_data$V2
colnames(tab)[9] <- "Population"

# reorder lake names
tab$Population <- factor(tab$Population, levels = c("NOR","BAL","RUS","LOM","ECK","ALP","LTE","HAW","RTA","BWA","UWA"))

# Plot results ----

cols <- c("HAW"="#DC8C91","BWA"="#e3655b","ECK"="#74BEE9","LOM"="#2372CD","RTA"="firebrick1","LTE"="gold1","UWA"="firebrick4","ALP"="#874000","NOR"="#b08ea2","BAL"="#6b9080","RUS"="#464e47")

# PC1 and PC2
ggplot(tab, aes(x=EV1, y=EV2,col=Population)) + geom_point(size=6) + labs(x=paste("EV1",round(pc.percent, 2)[1],"%"), y = paste("EV2", round(pc.percent, 2)[2], "%")) + theme_bw() + scale_color_manual(values=cols) +
  theme(axis.title.y = element_text(size = 30),axis.title.x = element_text(size = 30),axis.text.x = element_text(size=21),axis.text.y = element_text(size=21)) +
  theme(legend.title = element_text(size=20)) + theme(legend.text = element_text(size=18))

# PC1 and PC3
ggplot(tab, aes(x=EV1, y=EV3,col=Population)) + geom_point(size=6) + labs(x=paste("EV1",round(pc.percent, 2)[1],"%"), y = paste("EV3", round(pc.percent, 2)[2], "%")) + theme_bw() + scale_color_manual(values=cols) +
  theme(axis.title.y = element_text(size = 30),axis.title.x = element_text(size = 30),axis.text.x = element_text(size=21),axis.text.y = element_text(size=21)) +
  theme(legend.title = element_text(size=20)) + theme(legend.text = element_text(size=18))


# Calculate the SNP correlations between eigenvactors and SNP genotypes:
chr <- read.gdsn(index.gdsn(genofile, "snp.chromosome"))
chr2 <- parse_number(chr)
CORR <- snpgdsPCACorr(pca, genofile, eig.which=1:4)

savepar <- par(mfrow=c(3,1), mai=c(0.3, 0.55, 0.1, 0.25))
for (i in 1:3)
{
  plot(abs(CORR$snpcorr[i,]), ylim=c(0,1), xlab="", ylab=paste("PC", i),
       col=chr2, pch="+")
}

# plot Figure S1
savepar <- par(mfrow=c(3,1), mai=c(0.3, 0.55, 0.1, 0.25))
for (i in 1:3)
{
  plot(CORR$snpcorr[i,], ylim=c(-1,1), xlab="", ylab=paste("PC", i),
       col=chr2, pch="+")
}

```

#### Genome-wide FST

We calculated genome-wide FST across populations using `vcftools`, and then visualised the results in R.

```{bash, eval = FALSE}
cd ~/Desktop/biogeography/07.Population_genetics/Fst 

vcftools --vcf ../filtered.final.maf5.recode.vcf --out LOM_ALP_Fst --weir-fst-pop lom.txt --weir-fst-pop alp.txt --fst-window-size 1000000

vcftools --vcf ../filtered.final.maf5.recode.vcf --out LOM_BAL_Fst --weir-fst-pop lom.txt --weir-fst-pop bal.txt --fst-window-size 1000000

vcftools --vcf ../filtered.final.maf5.recode.vcf --out LOM_NOR_Fst --weir-fst-pop lom.txt --weir-fst-pop nor.txt --fst-window-size 1000000

vcftools --vcf ../filtered.final.maf5.recode.vcf --out LOM_LTE_Fst --weir-fst-pop lom.txt --weir-fst-pop lte.txt --fst-window-size 1000000

vcftools --vcf ../filtered.final.maf5.recode.vcf --out LOM_RTA_Fst --weir-fst-pop lom.txt --weir-fst-pop rta.txt --fst-window-size 1000000

vcftools --vcf ../filtered.final.maf5.recode.vcf --out LOM_BWA_Fst --weir-fst-pop lom.txt --weir-fst-pop bwa.txt --fst-window-size 1000000

vcftools --vcf ../filtered.final.maf5.recode.vcf --out LOM_UWA_Fst --weir-fst-pop lom.txt --weir-fst-pop uwa.txt --fst-window-size 1000000

vcftools --vcf ../filtered.final.maf5.recode.vcf --out LOM_ECK_Fst --weir-fst-pop lom.txt --weir-fst-pop eck.txt --fst-window-size 1000000

vcftools --vcf ../filtered.final.maf5.recode.vcf --out ECK_ALP_Fst --weir-fst-pop eck.txt --weir-fst-pop alp.txt --fst-window-size 1000000

vcftools --vcf ../filtered.final.maf5.recode.vcf --out ECK_BAL_Fst --weir-fst-pop eck.txt --weir-fst-pop bal.txt --fst-window-size 1000000

vcftools --vcf ../filtered.final.maf5.recode.vcf --out ECK_NOR_Fst --weir-fst-pop eck.txt --weir-fst-pop nor.txt --fst-window-size 1000000

vcftools --vcf ../filtered.final.maf5.recode.vcf --out ECK_LTE_Fst --weir-fst-pop eck.txt --weir-fst-pop lte.txt --fst-window-size 1000000

vcftools --vcf ../filtered.final.maf5.recode.vcf --out ECK_RTA_Fst --weir-fst-pop eck.txt --weir-fst-pop rta.txt --fst-window-size 1000000

vcftools --vcf ../filtered.final.maf5.recode.vcf --out ECK_BWA_Fst --weir-fst-pop eck.txt --weir-fst-pop bwa.txt --fst-window-size 1000000

vcftools --vcf ../filtered.final.maf5.recode.vcf --out ECK_UWA_Fst --weir-fst-pop eck.txt --weir-fst-pop uwa.txt --fst-window-size 1000000

vcftools --vcf ../filtered.final.maf5.recode.vcf --out BAL_ALP_Fst --weir-fst-pop bal.txt --weir-fst-pop alp.txt --fst-window-size 1000000

vcftools --vcf ../filtered.final.maf5.recode.vcf --out BAL_NOR_Fst --weir-fst-pop bal.txt --weir-fst-pop nor.txt --fst-window-size 1000000

vcftools --vcf ../filtered.final.maf5.recode.vcf --out BAL_LTE_Fst --weir-fst-pop bal.txt --weir-fst-pop lte.txt --fst-window-size 1000000

vcftools --vcf ../filtered.final.maf5.recode.vcf --out BAL_RTA_Fst --weir-fst-pop bal.txt --weir-fst-pop rta.txt --fst-window-size 1000000

vcftools --vcf ../filtered.final.maf5.recode.vcf --out BAL_BWA_Fst --weir-fst-pop bal.txt --weir-fst-pop bwa.txt --fst-window-size 1000000

vcftools --vcf ../filtered.final.maf5.recode.vcf --out BAL_UWA_Fst --weir-fst-pop bal.txt --weir-fst-pop uwa.txt --fst-window-size 1000000

```

```{r, eval = FALSE}
library(tidyverse)
setwd("~/Desktop/biogeography/07.Population_genetics/Fst")
lom_alp <- read.table("LOM_ALP_Fst.windowed.weir.fst", header = TRUE)
lom_bal <- read.table("LOM_BAL_Fst.windowed.weir.fst", header = TRUE)
lom_nor <- read.table("LOM_NOR_Fst.windowed.weir.fst", header = TRUE)
lom_lte <- read.table("LOM_LTE_Fst.windowed.weir.fst", header = TRUE)
lom_rta <- read.table("LOM_RTA_Fst.windowed.weir.fst", header = TRUE)
lom_uwa <- read.table("LOM_UWA_Fst.windowed.weir.fst", header = TRUE)
lom_bwa <- read.table("LOM_BWA_Fst.windowed.weir.fst", header = TRUE)
lom_eck <- read.table("LOM_ECK_Fst.windowed.weir.fst", header = TRUE)
eck_alp <- read.table("ECK_ALP_Fst.windowed.weir.fst", header = TRUE)
eck_bal <- read.table("ECK_BAL_Fst.windowed.weir.fst", header = TRUE)
eck_nor <- read.table("ECK_NOR_Fst.windowed.weir.fst", header = TRUE)
eck_lte <- read.table("ECK_LTE_Fst.windowed.weir.fst", header = TRUE)
eck_rta <- read.table("ECK_RTA_Fst.windowed.weir.fst", header = TRUE)
eck_uwa <- read.table("ECK_UWA_Fst.windowed.weir.fst", header = TRUE)
eck_bwa <- read.table("ECK_BWA_Fst.windowed.weir.fst", header = TRUE)
lom_bal <- read.table("LOM_BAL_Fst.windowed.weir.fst", header = TRUE)
eck_bal <- read.table("ECK_BAL_Fst.windowed.weir.fst", header = TRUE)
alp_bal <- read.table("BAL_ALP_Fst.windowed.weir.fst", header = TRUE)
nor_bal <- read.table("BAL_NOR_Fst.windowed.weir.fst", header = TRUE)
lte_bal <- read.table("BAL_LTE_Fst.windowed.weir.fst", header = TRUE)
rta_bal <- read.table("BAL_RTA_Fst.windowed.weir.fst", header = TRUE)
bwa_bal <- read.table("BAL_BWA_Fst.windowed.weir.fst", header = TRUE)
uwa_bal <- read.table("BAL_UWA_Fst.windowed.weir.fst", header = TRUE)

# Using Scottish populations as reference (Figure S5)
ggplot() + theme.pca() +
  stat_density(data = lom_alp, aes(x = WEIGHTED_FST), size = 2,geom="line", colour = "#874000") +
  stat_density(data = lom_bal, aes(x = WEIGHTED_FST), size = 2,geom="line", colour = "#6b9080") +
  stat_density(data = lom_nor, aes(x = WEIGHTED_FST), size = 2,geom="line", colour = "#b08ea2") +
  stat_density(data = lom_lte, aes(x = WEIGHTED_FST), size = 2,geom="line", colour = "gold1") +
  stat_density(data = lom_rta, aes(x = WEIGHTED_FST), size = 2,geom="line", colour = "firebrick1") +
  stat_density(data = lom_uwa, aes(x = WEIGHTED_FST), size = 2,geom="line", colour = "firebrick1") +
  stat_density(data = lom_bwa, aes(x = WEIGHTED_FST), size = 2,geom="line", colour = "firebrick1") +
  stat_density(data = lom_eck, aes(x = WEIGHTED_FST), size = 2,geom="line", colour = "#2372CD") +
  stat_density(data = eck_alp, aes(x = WEIGHTED_FST), size = 2,geom="line", colour = "#874000") +
  stat_density(data = eck_bal, aes(x = WEIGHTED_FST), size = 2,geom="line", colour = "#6b9080") +
  stat_density(data = eck_nor, aes(x = WEIGHTED_FST), size = 2,geom="line", colour = "#b08ea2") +
  stat_density(data = eck_lte, aes(x = WEIGHTED_FST), size = 2,geom="line", colour = "gold1") +
  stat_density(data = eck_rta, aes(x = WEIGHTED_FST), size = 2,geom="line", colour = "firebrick1") +
  stat_density(data = eck_uwa, aes(x = WEIGHTED_FST), size = 2,geom="line", colour = "firebrick1") +
  stat_density(data = eck_bwa, aes(x = WEIGHTED_FST), size = 2,geom="line", colour = "firebrick1") + 
  ylim(0,3) + labs(x = "Weir & Cockerham Fst", y = "Density")

# Using Baltic population as reference (Figure S5)
ggplot() + theme.pca() +
  stat_density(data = lom_bal, aes(x = WEIGHTED_FST), size = 2,geom="line", colour = "#2372CD") +
  stat_density(data = eck_bal, aes(x = WEIGHTED_FST), size = 2,geom="line", colour = "#2372CD") +
  stat_density(data = alp_bal, aes(x = WEIGHTED_FST), size = 2,geom="line", colour = "#874000") +
  stat_density(data = nor_bal, aes(x = WEIGHTED_FST), size = 2,geom="line", colour = "#b08ea2") +
  stat_density(data = lte_bal, aes(x = WEIGHTED_FST), size = 2,geom="line", colour = "gold1") +
  stat_density(data = rta_bal, aes(x = WEIGHTED_FST), size = 2,geom="line", colour = "firebrick1") +
  stat_density(data = uwa_bal, aes(x = WEIGHTED_FST), size = 2,geom="line", colour = "firebrick1") +
  stat_density(data = bwa_bal, aes(x = WEIGHTED_FST), size = 2,geom="line", colour = "firebrick1") + 
  ylim(0,3) + labs(x = "Weir & Cockerham Fst", y = "Density")


# calculate and plot z-Fst across 1 Mb windows

# define windows to help with plotting
lom_alp$window <- 1:nrow(lom_alp)
lom_bal$window <- 1:nrow(lom_bal)
lom_nor$window <- 1:nrow(lom_nor)
lom_lte$window <- 1:nrow(lom_lte)
lom_rta$window <- 1:nrow(lom_rta)

# calculate z-fst
lom_alp$z_fst <- (lom_alp$WEIGHTED_FST - mean(lom_alp$WEIGHTED_FST)) /sd(lom_alp$WEIGHTED_FST)
lom_bal$z_fst <- (lom_bal$WEIGHTED_FST - mean(lom_bal$WEIGHTED_FST)) /sd(lom_bal$WEIGHTED_FST)
lom_nor$z_fst <- (lom_nor$WEIGHTED_FST - mean(lom_nor$WEIGHTED_FST)) /sd(lom_nor$WEIGHTED_FST)
lom_lte$z_fst <- (lom_lte$WEIGHTED_FST - mean(lom_lte$WEIGHTED_FST)) /sd(lom_lte$WEIGHTED_FST)
lom_rta$z_fst <- (lom_rta$WEIGHTED_FST - mean(lom_rta$WEIGHTED_FST)) /sd(lom_rta$WEIGHTED_FST)

# plot Figure S6
nCHR <- length(unique(lom_alp$CHROM))

lom_alp_plot <- ggplot(data = filter(lom_alp, N_VARIANTS >= 5), aes(x = window, y = z_fst, colour = as.factor(CHROM))) + 
  geom_point(size = 3) + geom_jitter(width = 0.1) + theme_bw() +
  scale_color_manual(values = rep(c("gray80", "#183059"), nCHR)) + theme(legend.position = "none") +
  theme(axis.text.x = element_text(angle = 45)) + 
  geom_hline(yintercept = 3, linetype = 'dashed') +
  ylim(0,3.5) 

lom_nor_plot <- ggplot(data = filter(lom_nor, N_VARIANTS >= 5), aes(x = window, y = z_fst, colour = as.factor(CHROM))) + 
  geom_point(size = 3) + geom_jitter(width = 0.1) + theme_bw() +
  scale_color_manual(values = rep(c("gray80", "#183059"), nCHR)) + theme(legend.position = "none") +
  theme(axis.text.x = element_text(angle = 45)) + 
  geom_hline(yintercept = 3, linetype = 'dashed') +
  ylim(0,3.5) 

lom_bal_plot <- ggplot(data = filter(lom_bal, N_VARIANTS >= 5), aes(x = window, y = z_fst, colour = as.factor(CHROM))) + 
  geom_point(size = 3) + geom_jitter(width = 0.1) + theme_bw() +
  scale_color_manual(values = rep(c("gray80", "#183059"), nCHR)) + theme(legend.position = "none") +
  theme(axis.text.x = element_text(angle = 45)) + 
  geom_hline(yintercept = 3, linetype = 'dashed') +
  ylim(0,3.5) 

lom_lte_plot <- ggplot(data = filter(lom_lte, N_VARIANTS >= 5), aes(x = window, y = z_fst, colour = as.factor(CHROM))) + 
  geom_point(size = 3) + geom_jitter(width = 0.1) + theme_bw() +
  scale_color_manual(values = rep(c("gray80", "#183059"), nCHR)) + theme(legend.position = "none") +
  theme(axis.text.x = element_text(angle = 45)) + 
  geom_hline(yintercept = 3, linetype = 'dashed') +
  ylim(0,3.5) 

lom_rta_plot <- ggplot(data = filter(lom_rta, N_VARIANTS >= 5), aes(x = window, y = z_fst, colour = as.factor(CHROM))) + 
  geom_point(size = 3) + geom_jitter(width = 0.1) + theme_bw() +
  scale_color_manual(values = rep(c("gray80", "#183059"), nCHR)) + theme(legend.position = "none") +
  theme(axis.text.x = element_text(angle = 45)) + 
  geom_hline(yintercept = 3, linetype = 'dashed') +
  ylim(0,3.5) 

library(patchwork)

lom_nor_plot / 
  lom_alp_plot /
  lom_bal_plot /
  lom_lte_plot /
  lom_rta_plot

```

#### TreeMix
We explored and visualised admixture events using TreeMix using the pop gen dataset, testing 0-11 migration edges.

```{bash, eval = FALSE}
for i in {0..11}
do
 treemix -i $@ -m $i -o ./results/edge.$i -k 100 > treemix_${i}_log &
done
```