# Dataset 
This comprises the dataset and analysis code used for the paper
"Does Interacting Help Users Better Understand the Structure of Probabilistic Models?", Evdoxia Taka, Sebastian Stein, John H. Williamson.

This is a collection of data and code related to a user study for the evaluation of the effect of interactive visualizations in users' understanding of probabilistic models' structure. The dataset consists of participants' responses, demographics and interaction logs using either a static or interactive version of pair plot visualizations.

## Collected Data
The data from the user study is held in `study_analysis/data/study_01.db`, an sql database. The database contains the questions of the study, responses of participants, their demographics, and interaction logs of the interactive group. The script `study_analysis/utils/db/utils.py` provides methods for connecting to the database. The script `study_analysis/utils/db/get_data_db.py` provides methods for retrieving the data from the database.

The following scheme describes the tables and columns of the database.

TABLE participants # table for participants
	COLUMN name integer NOT NULL PRIMARY KEY # participant id - random integer between 0-200
	COLUMN mode text NOT NULL # visualization condition in {'i','s','u'}, where 'i':interactive, 's':static, and 'u' for multiple registrations made by same participant mistakenly
	COLUMN status text NOT NULL # user study status of participants e.g. 'training_videos', 'task', 'demo', 'end', 'end_thanks'
	COLUMN begin_timestamp text # timestamp when participant started the study
	COLUMN end_timestamp text # timestamp when participant finished the study
	COLUMN results_email text # participants email to receive results or publication of study
	COLUMN pool_email text	# participants email if they wish to be added to an experiment pool list for future experiments
 	
TABLE d_questions # table for demographics questions
	COLUMN name text NOT NULL PRIMARY KEY # demo question id e.g. 'd1', 'd2', 'd3', ...
	COLUMN question text NOT NULL # demo question asked to participants
	COLUMN field integer NOT NULL # 0 or 1 whether a text field is required to be added to enable participants input an `other` option not included in the given options

TABLE d_options # table for demographics questions' available response options (presented as radio buttons)
	COLUMN id integer PRIMARY KEY # incremental id e.g. 0, 1, 2, ...
	COLUMN d_name text NOT NULL FOREIGN KEY (d_name) REFERENCES d_questions (name)
	COLUMN option text NOT NULL # response option e.g. '18-25', '26-40','41-65',...
	
Table problems # table for three study problems
	COLUMN name text NOT NULL PRIMARY KEY # problem id e.g. 'p1', 'p2',...
	COLUMN desc text NOT NULL # description of problem context presented to participants
	COLUMN file text NOT NULL # file path of .npz file of sample-based inference for the problem 
	
Table t_questions # table for task questions 
	COLUMN name text NOT NULL # task question id e.g. 't1', 't2', 't3', ..
	COLUMN problem text NOT NULL FOREIGN KEY (problem) REFERENCES problems (name)
	COLUMN question text NOT NULL # task question asked to participants
	COLUMN space text NOT NULL # inference space in {'prior','posterior'}. In this user study this was always set to 'prior'
	COLUMN mult_opts integer NOT NULL # 0 for radio buttons allowing single selection, 1 for check boxes allowing multiple selections as a response
	COLUMN show_widgets integer NOT NULL # 0 for hiding widget box for setting the indexing dimensions of data presented by the pair plot, 1 otherwise
	COLUMN show_graphs integer NOT NULL # 1 for showing graphical representations of available response options, 0 otherwise
	
Table t_graphs # table for tasks' response options' graph images
	COLUMN id integer PRIMARY KEY # incremental id e.g. 0, 1, 2, ...
	COLUMN t_name text NOT NULL FOREIGN KEY (t_name) REFERENCES t_questions (name)
	COLUMN file text NOT NULL # file path of graph image
	
Table t_options # table for task questions' available response options (presented as either radio buttons or check boxes)  	
	COLUMN id integer PRIMARY KEY # incremental id e.g. 0, 1, 2, ...
	COLUMN t_name text NOT NULL FOREIGN KEY (t_name) REFERENCES t_questions (name)
	COLUMN option text NOT NULL # response option e.g. 'a', 'b', 'c', or 'none'
	COLUMN iscorrect integer NOT NULL # 0 if this is not the (a) correct answer for this question t_name, 1 otherwise
	
Table t_params # table for parameters (random variables) of models that would be present in pair plot in each task
	COLUMN id integer PRIMARY KEY # incremental id e.g. 0, 1, 2, ...
	COLUMN t_name text NOT NULL FOREIGN KEY (t_name) REFERENCES t_questions (name)
	COLUMN param text NOT NULL # variable of model e.g. 'a', 'b' or 'temperature'
	
Table t_dims # table for indexing dimensions of data to be set in pair plots in each task
	COLUMN id integer PRIMARY KEY # incremental id e.g. 0, 1, 2, ...
	COLUMN t_name text NOT NULL FOREIGN KEY (t_name) REFERENCES t_questions (name)
	COLUMN dim text NOT NULL # e.g. 'year' in Problem 1
	COLUMN value text NOT NULL # e.g. '2018' in Problem 1
	
Table d_answers # table for participants' answers to demographic questions
	COLUMN d_name text NOT NULL FOREIGN KEY (d_name) REFERENCES d_questions (name)
	COLUMN participant_id integer NOT NULL FOREIGN KEY (participant_id) REFERENCES participants (name) 
	COLUMN option text NOT NULL # demographics response option selected by participant
	COLUMN text text # text inserted to text field if 'other' option was selected
	PRIMARY KEY (d_name, participant_id)

Table t_answers # table about participants' answers to task questions
	COLUMN t_name text NOT NULL FOREIGN KEY (t_name) REFERENCES t_questions (name)
	COLUMN participant_id integer NOT NULL FOREIGN KEY (participant_id) REFERENCES participants (name)
	COLUMN response_time real NOT NULL # time participant needed to give his response to the question
	COLUMN confidence text NOT NULL # confidence level participant recorded about his confidence of his response in scale 0-5
	COLUMN w_interactions integer NOT NULL # number of interactions with widget box if present (only in task t19)
	COLUMN s_interactions integer # number of selection boxes drawn if participant in IG
	PRIMARY KEY (t_name, participant_id)
	
TABLE t_answers_selections # table for selection boxes drawn by participants in IG in study tasks
	COLUMN id integer PRIMARY KEY # incremental id e.g. 0, 1, 2, ...
	COLUMN t_name text NOT NULL FOREIGN KEY (t_name) REFERENCES t_questions (name)
	COLUMN participant_id integer NOT NULL FOREIGN KEY (participant_id) REFERENCES participants (name)
	COLUMN sel_order integer NOT NULL # incremental number 0,1,2,.. indicating the order of the selection box that participants drew for task t_name
	COLUMN var_name text NOT NULL # model's variable name that the participant interacted with
	COLUMN xmin real # min x coordinate of the selection box
	COLUMN xmax real # max x coordinate of the selection box
	
TABLE t_answers_opts # table for participants' answers to task questions
	COLUMN id integer PRIMARY KEY # incremental id e.g. 0, 1, 2, ...
	COLUMN t_name text NOT NULL FOREIGN KEY (t_name) REFERENCES t_questions (name)
	COLUMN participant_id integer NOT NULL FOREIGN KEY (participant_id) REFERENCES participants (name) 
	COLUMN option text NOT NULL # task response option selected by participant
	
NOTE: The email addresses of the participants have been removed to ensure anonymization of the data.

## File structure

File structures:

        study_analysis/
			analysis.ipynb				# Code for the preprocessing and statistical analysis of the data. The code of all visualizations of analysis' results is also present here.
			analysis.pdf				# Export of analysis.ipynb jupyter notebook in .pdf
			demographics_analysis.ipynb	# Code for the participants' demographics' summary statistics and visualizations
			demographics_analysis.pdf	# Export of demographics_analysis.ipynb jupyter notebook in .pdf
            data/						# Files containing the collected data and generated data for the user study questions 
				study_01.db						# sql database containing the collected data
				min_temperature.npz				# Inference data of probabilistic model of Problem 1 (code for model can be found in https://github.com/evdoxiataka/ipme/tree/master/examples/user_study/min_temperature)
				transformation.npz				# Inference data of probabilistic model of Problem 2 (code for model can be found in https://github.com/evdoxiataka/ipme/tree/master/examples/user_study/random_number_generator )
				reaction_times_hierarchical.npz	# Inference data of probabilistic model of Problem 3 (code for model can be found in https://github.com/evdoxiataka/ipme/tree/master/examples/user_study/reaction_times )
            utils/						# Scripts to retrieve data and generate models and visualizations for the analysis of the data
				db/
					get_data_db.py		# Methods to retrieve data from database `study_analysis/data/study_01.db`
					utils.py			# Methods to connect to the database
				information_retrieval.py # Methods to prepare data for the analysis and visualization 
				models.py				 # Bayesian models defined in PyMC3 and used for the statistical analysis of the data
				visualization.py		 # Methods used to create the visualizations presenting the results of the analysis in the paper

## Running scripts and notebooks

### Requirements
Python, with the following packages are required to re-run the analysis and recreate the visualisations of the results.
* python 3.6+
* numpy
* scipy
* pandas
* matplotlib
* seaborn
* pymc3
* arviz
* sqlite3

### Re-runnning analysis of data
The `study_analysis/analysis.ipynb` notebook contains the code for the preprocessing and statistical analysis of the data. The visualizations of the analysis results presented in the paper can be recreated by the code in this notebook.

The `study_analysis/demographics_analysis.ipynb` notebook contains the code for the generation of the participants' demographics' summary statistics. The visualizations of this analysis presented in the paper can be recreated by the code in this notebook.

### Connecting and retrieving data from the database
The `study_analysis/utils/db/get_data_db.py` contains methods for retrieving data from the database.