This folder contains the required documentation, scripts, and files to replicate the results in the manuscript 'Topic Modelling for Untargeted Substructure Exploration in Metabolomics'.

The 'MS2LDA' folder contains the sources for both the R and Python codes of the workflow.

The 'analysis' folder contains Jupyter Notebooks used for analysis and generating results in the paper


================================================================================
Installing the required software packages to run MS2LDA:

1. Setting up R
----------------

MS2LDA relies on R to perform feature extraction from the mzXML/mzML files into LDA count matrices. R can be obtained from https://www.r-project.org. We also recommend installing R-Studio (https://www.rstudio.com), a widely-used development environment.

The two most important R packages to install are XCMS and RMassBank, available from Bioconductor:

* http://bioconductor.org/packages/release/bioc/html/xcms.html
* http://www.bioconductor.org/packages/release/bioc/html/RMassBank.html

Additionally, 2 R packages need to be installed:

> install.packages('gtools') # for natural sorting
> install.packages('yaml') # to load config file

2. Setting up Python
---------------------

1. MS2LDA inference and visualisation is implemented in Python. If you already have a Python environment installed, ensure that it is based on Python version 2.7 and that the following libraries are present with the following commands:

> pip install numpy

> pip install scipy

> pip install matplotlib

Jupyter notebook can be installed following http://jupyter.readthedocs.io/en/latest/install.html.

2. For new users, we recommend installing the Anaconda Python distribution: https://store.continuum.io/cshop/anaconda. Anaconda conveniently installs Python, the Jupyter Notebook, and other commonly used packages for scientific computing and data science. Ensure that you download the installer for Python 2.7.

3. **Launch the installer for Anaconda Python** and proceed with the installation process. Accept all the default options and wait for installation to finish.

3. Running example analysis
---------------------------

1. Windows: from the "Anaconda" start menu folder, **launch the "IPython (2.7) Notebook"**. This opens the notebook client in the web browser.

2. Linux/OSX: runs 'jupyter notebook' from the terminal.

3. Once Jupyter notebook environment has been loaded, navigate to the 'analysis' folder and load one of the notebooks inside. Each cell in the notebook can be executed sequentially by first selecting a cell, then pressing Shift-Enter to run that cell and moving on to the next one. Alternatively, from the notebook menu, click **Cell** followed by **Run All** to execute all cells. The following notebooks are provided as examples of the kind of analysis that can be done from the MS2LDA workflow:

    - example_notebook_1.ipynb demonstrates how to run the Mass2Motif discovery stage of MS2LDA
    - example_notebook_2.ipynb demonstrates how to load an existing MS2LDA analysis and visualise it
    - standards_mass2motifs.ipynb demonstates how to perform a matching of the MS1 peaklist in MS2LDA against an external list of standards and plot their Mass2Motif compositions
    - de_notebook.ipynb demonstrates how to perform differential analysis on the Mass2Motifs discovered by MS2LDA and compares their results against cosine clustering.

4. Files Descriptions
---------------------------------

The following files can be found in the root folder upon unzipping:
- FourBeers_PositiveMode_MolecularNetworking.cys is the Molecular Networking output for the beer samples in positive ionization mode.
- FourBeers_NegativeMode_MolecularNetworking.cys is the Molecular Networking output for the beer samples in negative ionization mode.
- GNPS_Mass2Motif_validations.csv is the validation result for the GNPS spectra.
-
MassBank_Mass2Motif_validations.csv is the validation result for the MassBank spectra.

The following files/folders can be found in the **analysis** folder:

- 4 Jupyter notebook (with the .ipynb extension), alongside a single python file (peak_match.py) to illustrate the various the analysis tasks listed in section (3).
- **input_files** contains the input count matrices for LDA, alongside other peaklists for standard matching and expression information.
- **projects** contains the saved MS2LDA analysis used to generate results in the manuscript
- **yml** contains the .yml files used for the feature extraction stage of the different Beer input files in the manuscript.

The following files/folders can be found in the **FourBeers_Data_mzxml** folder:

- Three standard compounds information generated during data acquisition (in Std*.csv)
- mzXML or mzML files for the MS1 and MS2 fragmentation data
    - full_scan_files contains the full scan MS1 data of the four beers in 3 replicates (e.g. Beer_1_full1_pos.mzXML), the pooled Beer QC sample (e.g. Beer_PoolB_full_e_pos.mzXML) and the blanks (e.g. blank3_pos.mzXML)
    - fragmentation_files contains files in mzXML and mzML formats. For e.g. Beer_1_T10_POS.mzML is the top 10 (T10) data dependent analysis mass fragmentation experiment in Positive (POS) or Negative (NEG) Ionization Separate Fragmentation modes. For e.g. Beer_1_T10_posnegPOS.mzML is the top 10 (T10) or Top 5 (Top5) data dependent analysis mass fragmentation experiment in Positive (POS) or Negative (NEG) Ionization Combined Fragmentation modes.