Codes used for MS2LDA+ data processing
---

1. R scripts
=============

The `R` directory contains the codes to extract the MS1-MS2 peaklists from .mzXML/.mzML files.
- **MS1MS2_MatrixGeneration.R** processes mzXML/mzML files and produces a list of MS1 & MS2 peaks. This is the main script to run to perform data conversion for MS2LDA+ from .mzXML and .mzML to peaklists in CSV format. Be sure to set the configuration files inside the `config` folder to point to the right location of the files.
- **xcmsPeakPicking.R** is an additional script to perform peak picking in XCMS, producing peakML files. Used for differential analysis.
- **mzMatch_process.R** performs the matrix of MS1 peaks across samples. Used for differential analysis.

Requirements:

    install.packages('yaml')
    install.packages('gtools')

    source('https://bioconductor.org/biocLite.R')
    biocLite('xcms')
    biocLite('RMassBank')

2. Python scripts
=============

Once the peaklist has been generated by the R scripts above, we can run LDA. The `Python` contains the necessary codes to do this. Specifically:
- **run_lda.py** is the main script to call. Be sure to adjust the parameters pointing to the right location of the extracted peaklists there.
- Equivalently, **run_lda.ipynb** performs the same function as the script above but in Juypyter notebook format. It also demonstrates some of the analysis plots that can be computed from MS2LDA+ output.

Requirements:

We recommend Anaconda Python, which comes with all the required packages installed. Othewrwise you can install the following packages yourself in your preferred Python distribution: numpy, scipy, pandas.

3. Stool Sample Analysis
========================

For the analysis of the stool samples, refer to the 'multifile_fecal_2e6.ipynb' Jupyter notebook inside the 'stool_analysis' folder. The notebook shows how a completely python-based mzML parser and feature extraction steps are performed, before MS2lDA+ analysis is performed.