# Extracting chemical reactions from text using Snorkel Authors: Emily K Mallory, Matthieu de Rochemonteix, Alex Ratner, Ambika Acharya, Chris Re, Roselie A Bright, Russ B Altman Contact: rbaltman@stanford.edu ## Description This document describes the supporting data and methods for 'Extracting chemical reactions from text using Snorkel'. ## Corpus The file "Bacteria_Corpus_full.txt" contains the full list of PubMed identifiers for Bacteria_Corpus. ## Supporting notebooks The following notebooks are provided: ChemRxn_evaluation.ipynb Result_analysis_FDE.ipynb Validation_FDAChems.ipynb ## General Setup For most of these experiments, the computational power needed is high and the runtime is long. This is especially due to a focus on larger databases to test the scalability of the experiments. Most of the experiments have been run on a large compute cluster. As it is not a requirement, this is strongly advised. Similarly, setting up a PostGre server is strongly advised as it allows to have more scalable databases that support parallelism. ## Chemical Reaction ### Small version : `metacyc_database`, aka `fda` #### Database setup Request access to the complete metacyc_database. Experiments were performed using scuba, a command line interface for Snorkel. To run the experiment, copy the database to the `scuba` home folder. Then, run the experiment from step 1. #### Running the experiment The results reported have been obtained using the following command: ```python -u run.py --exp fda --verbose --lfs 1 --start_at 1 --disc_model_search_space 10 --recompute_feats``` The `--lfs 1` option means that we only use file `lfs_1.py` for the labeling functions. ### Extended version: `bacteria_database`, aka `fde` #### Database setup This experiment uses the Bacteria Corpus. In order to build the Corpus, we used the `snorkel-biocorpus` tool. Refer to the documentation of this repo to see how it can be used. To build the database, we need a list of PMIDs, `Bacteria_Corpus_full.txt` We advise to follow the following procedure: * Download a full raw text dump of the PubTator data (~40GB) * Filter the raw dump using the `filter_pubtator_file` to get the subset of interest. * Then run the `snorkel_biocorpus` tool on the reduced dump. This parses the selected abstracts, and adds them to a a database in the snorkel format. We advise using a PostGres database. #### Running the experiment We strongly advise to run this experiment step by step, using the `--one_only` option. It has some specificities due to its size. Furthermore, this experiment is a 5-split experiment. It should therefore be run with the following arguments: * `--n_splits 5` for 5 splits * ` --dev_split 3` * `--test_split 4` * `--label_all_splits` to avoid having to recompute on splits 1 and 2 later It may be necessary to run the Candidate extraction with batches of documents, using the `--candidate_extraction_batch_size` option, as it is more stable and allows to begin the extraction at a given batch if a previous run has failed, instead of re-running everything. The execution is straightforward until step 5. To use Gibbs sampling for the generative model, run the regular step 5. This experiment uses extra labeling functions compared to the `fde` experiment ### Using marginals from another generative model First, you need to export the labels corresponding to the labeling functions. This is is done by running the `run_get_model_perf` script with the `--export_labels` argument. This will export the labels matrix in sparse format to `.npz` files. Then, once the files are exported, run the generative model separately. Then dump the marginals to the `.npy` format, with `allw_pickle=False`. To use them in the pipeline, run `run.py` with the argument `--marginals_path `where the marginals folder contains one file per split named `marginals_.npy` with the marginals. ### Final result exportation Result exportation is done using the `--export_pred` flag of the `run.py` function. This will dump a tsv file with the candidates, the marginals for the discriminative model, the prediction, and relevant information to match the candidates with other databases. This process may take long. The `--recompute_feats` flag should be off. ###Error analysis and plots For error analysis, performance analysis and debug, several options are possible. * The `custom_error_analysis` flag for `run.py` exports prediction results on the dev set, that are intended for visualization * Running models with the config argument`allchecks` set to `True` will checkpoint the discriminative model at each epoch during the training. This is the simplest way to generate learning curves afterwards (despite being costly in terms of time and storage). The argument `prefix` should also be set in the config in this case, as it identifies the dumps. The script `run_get_model_perf.py` also gives a variety of tools to analyze the models. Among them, we have: * Exportation of the labels matrix using `export_labels` * Exportation of the ROC, PR curves and of histograms of marginals for all the categories, for all splits except 0 and 1, using the `--export_type` argument, that is either `disc` or `gen`. This will plot from the results of the gridsearch. `--n_models` determines the number of models to consider. This will dump `.png` files to the `plots` folder. * Manual selection of a model from the gridsearch: Currently implemented for the `SparseLogisticRegression` and `GenerativeModel` only : use `--export_type` to set the model type to export (`disc` or `gen`), and use `model_selector` to select which model to load as the selected model * Exportation of marginals: Use the `exp_best_margs` argument to export the marginals of the discriminative and generative model, and the gold labels, for the labeled splits, for the best model, into a csv file: `exportation_marginals_df.csv` ### Building the curated drug lists ## Notebooks #### `ChemRxn_evaluation.ipynb` This notebooks performs basic exploration of the database built for the `fde` experiment. Basic database building tasks can also be performed, such as labeling candidates, or defining new splits/ exporting them. This notebook contains results from Table 7. #### `Result_analysis_FDE` This notebook performs analysis of results on the FDE experiment. This notebook is deprecated. The preferred method is to export the results using the `run_get_model_perf` script with the `exp_best_margs` flag. #### `Validation_FDAChems` This model allows to perform validation on an extra subset for the FDA experiment. This notebook contains results from Table 7. ## Appendix : Reference configs ### MetaCyc_Corpus, aka FDA ```json config = { 'disc-model-class': SparseLogisticRegression, 'featurizer-class': FeatureAnnotator, 'featurizer-init-params': dict(), 'disc-init-params': { 'n_threads': 4, 'seed' : 123 }, 'disc-params-default': { 'n_epochs' : 750, 'rebalance' : 0.1, 'print_freq' : 50, 'batch_size': 64, 'beta':0.9, "lr": 2e-4 , 'l1_penalty' : 2e-4, 'l2_penalty' : 1e-4, "allchecks": False }, 'disc-eval-batch-size': 32, 'deps-thresh': 0.01, 'gen-init-params': { 'lf_propensity' : True, 'seed' : 123 }, "disc-params-range" : { 'lr' : [ 2e-4, 1e-5,1e-4,5e-4,5e-5], 'l1_penalty' : [ 1e-3,1e-4, 1e-5], 'l2_penalty' : [1e-2, 1e-3,1e-4], 'rebalance' : [0.1,0.2,0.05,0.4] } } ``` ### Bacteria_Corpus, aka FDE ```json config = { 'disc-model-class': SparseLogisticRegression, 'featurizer-class': FeatureAnnotator, 'featurizer-init-params': dict(), 'disc-init-params': { 'n_threads': 4, 'seed' : 123 }, 'disc-params-default': { 'n_epochs' : 40, 'rebalance' : 0.1, 'print_freq' : 10, 'batch_size': 64, "allchecks": False, "lr": 1e-3 , 'l1_penalty' : 1e-4, 'l2_penalty' : 1e-3, 'label_rebalancing_threshold':0.4 }, 'disc-eval-batch-size': 32, 'deps-thresh': 0.01, 'gen-init-params': { 'lf_propensity' : True, 'seed' : 123 }, 'gen-params-range': { 'step_size': [1e-2, 1e-3, 1e-4, 1e-5], 'reg_param': [0.05, 0.1,0.25, 0.5], 'LF_acc_prior_weight_default': [0.5, 1.0, 1.5], 'decay' : [0.95,0.98], 'epochs' : [50,75], 'threads':[2] }, "disc-params-range" : { 'dim':[50], 'dropout': [ 0.25], 'lr' : [ 1e-3, 1e-4,1e-5], 'l1_penalty' : [ 1e-4, 1e-6], 'l2_penalty' : [ 1e-3,1e-4,1e-5], 'rebalance' : [0.1,0.25], 'label_rebalancing_threshold':[0.4,0.5] } } ```