# Dataset This comprises the dataset and analysis code used for the paper "Does Interacting Help Users Better Understand the Structure of Probabilistic Models?", Evdoxia Taka, Sebastian Stein, John H. Williamson. This is a collection of data and code related to a user study for the evaluation of the effect of interactive visualizations in users' understanding of probabilistic models' structure. The dataset consists of participants' responses, demographics and interaction logs using either a static or interactive version of pair plot visualizations. ## Collected Data The data from the user study is held in `study_analysis/data/study_01.db`, an sql database. The database contains the questions of the study, responses of participants, their demographics, and interaction logs of the interactive group. The script `study_analysis/utils/db/utils.py` provides methods for connecting to the database. The script `study_analysis/utils/db/get_data_db.py` provides methods for retrieving the data from the database. The following scheme describes the tables and columns of the database. TABLE participants # table for participants COLUMN name integer NOT NULL PRIMARY KEY # participant id - random integer between 0-200 COLUMN mode text NOT NULL # visualization condition in {'i','s','u'}, where 'i':interactive, 's':static, and 'u' for multiple registrations made by same participant mistakenly COLUMN status text NOT NULL # user study status of participants e.g. 'training_videos', 'task', 'demo', 'end', 'end_thanks' COLUMN begin_timestamp text # timestamp when participant started the study COLUMN end_timestamp text # timestamp when participant finished the study COLUMN results_email text # participants email to receive results or publication of study COLUMN pool_email text # participants email if they wish to be added to an experiment pool list for future experiments TABLE d_questions # table for demographics questions COLUMN name text NOT NULL PRIMARY KEY # demo question id e.g. 'd1', 'd2', 'd3', ... COLUMN question text NOT NULL # demo question asked to participants COLUMN field integer NOT NULL # 0 or 1 whether a text field is required to be added to enable participants input an `other` option not included in the given options TABLE d_options # table for demographics questions' available response options (presented as radio buttons) COLUMN id integer PRIMARY KEY # incremental id e.g. 0, 1, 2, ... COLUMN d_name text NOT NULL FOREIGN KEY (d_name) REFERENCES d_questions (name) COLUMN option text NOT NULL # response option e.g. '18-25', '26-40','41-65',... Table problems # table for three study problems COLUMN name text NOT NULL PRIMARY KEY # problem id e.g. 'p1', 'p2',... COLUMN desc text NOT NULL # description of problem context presented to participants COLUMN file text NOT NULL # file path of .npz file of sample-based inference for the problem Table t_questions # table for task questions COLUMN name text NOT NULL # task question id e.g. 't1', 't2', 't3', .. COLUMN problem text NOT NULL FOREIGN KEY (problem) REFERENCES problems (name) COLUMN question text NOT NULL # task question asked to participants COLUMN space text NOT NULL # inference space in {'prior','posterior'}. In this user study this was always set to 'prior' COLUMN mult_opts integer NOT NULL # 0 for radio buttons allowing single selection, 1 for check boxes allowing multiple selections as a response COLUMN show_widgets integer NOT NULL # 0 for hiding widget box for setting the indexing dimensions of data presented by the pair plot, 1 otherwise COLUMN show_graphs integer NOT NULL # 1 for showing graphical representations of available response options, 0 otherwise Table t_graphs # table for tasks' response options' graph images COLUMN id integer PRIMARY KEY # incremental id e.g. 0, 1, 2, ... COLUMN t_name text NOT NULL FOREIGN KEY (t_name) REFERENCES t_questions (name) COLUMN file text NOT NULL # file path of graph image Table t_options # table for task questions' available response options (presented as either radio buttons or check boxes) COLUMN id integer PRIMARY KEY # incremental id e.g. 0, 1, 2, ... COLUMN t_name text NOT NULL FOREIGN KEY (t_name) REFERENCES t_questions (name) COLUMN option text NOT NULL # response option e.g. 'a', 'b', 'c', or 'none' COLUMN iscorrect integer NOT NULL # 0 if this is not the (a) correct answer for this question t_name, 1 otherwise Table t_params # table for parameters (random variables) of models that would be present in pair plot in each task COLUMN id integer PRIMARY KEY # incremental id e.g. 0, 1, 2, ... COLUMN t_name text NOT NULL FOREIGN KEY (t_name) REFERENCES t_questions (name) COLUMN param text NOT NULL # variable of model e.g. 'a', 'b' or 'temperature' Table t_dims # table for indexing dimensions of data to be set in pair plots in each task COLUMN id integer PRIMARY KEY # incremental id e.g. 0, 1, 2, ... COLUMN t_name text NOT NULL FOREIGN KEY (t_name) REFERENCES t_questions (name) COLUMN dim text NOT NULL # e.g. 'year' in Problem 1 COLUMN value text NOT NULL # e.g. '2018' in Problem 1 Table d_answers # table for participants' answers to demographic questions COLUMN d_name text NOT NULL FOREIGN KEY (d_name) REFERENCES d_questions (name) COLUMN participant_id integer NOT NULL FOREIGN KEY (participant_id) REFERENCES participants (name) COLUMN option text NOT NULL # demographics response option selected by participant COLUMN text text # text inserted to text field if 'other' option was selected PRIMARY KEY (d_name, participant_id) Table t_answers # table about participants' answers to task questions COLUMN t_name text NOT NULL FOREIGN KEY (t_name) REFERENCES t_questions (name) COLUMN participant_id integer NOT NULL FOREIGN KEY (participant_id) REFERENCES participants (name) COLUMN response_time real NOT NULL # time participant needed to give his response to the question COLUMN confidence text NOT NULL # confidence level participant recorded about his confidence of his response in scale 0-5 COLUMN w_interactions integer NOT NULL # number of interactions with widget box if present (only in task t19) COLUMN s_interactions integer # number of selection boxes drawn if participant in IG PRIMARY KEY (t_name, participant_id) TABLE t_answers_selections # table for selection boxes drawn by participants in IG in study tasks COLUMN id integer PRIMARY KEY # incremental id e.g. 0, 1, 2, ... COLUMN t_name text NOT NULL FOREIGN KEY (t_name) REFERENCES t_questions (name) COLUMN participant_id integer NOT NULL FOREIGN KEY (participant_id) REFERENCES participants (name) COLUMN sel_order integer NOT NULL # incremental number 0,1,2,.. indicating the order of the selection box that participants drew for task t_name COLUMN var_name text NOT NULL # model's variable name that the participant interacted with COLUMN xmin real # min x coordinate of the selection box COLUMN xmax real # max x coordinate of the selection box TABLE t_answers_opts # table for participants' answers to task questions COLUMN id integer PRIMARY KEY # incremental id e.g. 0, 1, 2, ... COLUMN t_name text NOT NULL FOREIGN KEY (t_name) REFERENCES t_questions (name) COLUMN participant_id integer NOT NULL FOREIGN KEY (participant_id) REFERENCES participants (name) COLUMN option text NOT NULL # task response option selected by participant NOTE: The email addresses of the participants have been removed to ensure anonymization of the data. ## File structure File structures: study_analysis/ analysis.ipynb # Code for the preprocessing and statistical analysis of the data. The code of all visualizations of analysis' results is also present here. analysis.pdf # Export of analysis.ipynb jupyter notebook in .pdf demographics_analysis.ipynb # Code for the participants' demographics' summary statistics and visualizations demographics_analysis.pdf # Export of demographics_analysis.ipynb jupyter notebook in .pdf data/ # Files containing the collected data and generated data for the user study questions study_01.db # sql database containing the collected data min_temperature.npz # Inference data of probabilistic model of Problem 1 (code for model can be found in https://github.com/evdoxiataka/ipme/tree/master/examples/user_study/min_temperature) transformation.npz # Inference data of probabilistic model of Problem 2 (code for model can be found in https://github.com/evdoxiataka/ipme/tree/master/examples/user_study/random_number_generator ) reaction_times_hierarchical.npz # Inference data of probabilistic model of Problem 3 (code for model can be found in https://github.com/evdoxiataka/ipme/tree/master/examples/user_study/reaction_times ) utils/ # Scripts to retrieve data and generate models and visualizations for the analysis of the data db/ get_data_db.py # Methods to retrieve data from database `study_analysis/data/study_01.db` utils.py # Methods to connect to the database information_retrieval.py # Methods to prepare data for the analysis and visualization models.py # Bayesian models defined in PyMC3 and used for the statistical analysis of the data visualization.py # Methods used to create the visualizations presenting the results of the analysis in the paper ## Running scripts and notebooks ### Requirements Python, with the following packages are required to re-run the analysis and recreate the visualisations of the results. * python 3.6+ * numpy * scipy * pandas * matplotlib * seaborn * pymc3 * arviz * sqlite3 ### Re-runnning analysis of data The `study_analysis/analysis.ipynb` notebook contains the code for the preprocessing and statistical analysis of the data. The visualizations of the analysis results presented in the paper can be recreated by the code in this notebook. The `study_analysis/demographics_analysis.ipynb` notebook contains the code for the generation of the participants' demographics' summary statistics. The visualizations of this analysis presented in the paper can be recreated by the code in this notebook. ### Connecting and retrieving data from the database The `study_analysis/utils/db/get_data_db.py` contains methods for retrieving data from the database.