![alt Decision aware multi-encoder transformer](https://github.com/anonymous12-lab/decison_aware_multi_source/blob/main/model.png)
<h3 align="center"> Decision aware multi-encoder transformer</h3>

### Reviews
1. This paper proposes a method for transfer learning, i.e. leveraging a network trained on some original task A in learning a new task B, which not only improves performance on the new task B, but also tries to avoid degradation in performance on A. The general idea is based on encouraging a model trained on A, while training on the new task B, to match fake targets produced by the model itself but when it is trained only on the original task A. Experiments show that this method can help in improving the result on task B, and is better than other baselines, including standard fine-tuning.   General comments/questions: - As far as I can tell, there is no experimental result supporting the claim that your model still performs well on the original task. All experiments show that you can improve on the new task only.  - The introduction makes a strong statements about the distilling logical rule engine into a neural network, which I find a bit misleading. The approach in the paper is not specific to transferring from logical rules (as stated in the Sec 2) and is simply relying on the rule engine to provide labels for unlabelled data. - One of the obvious baselines to compare with your approach is standard multi-task learning on both tasks A and B together. That is, you train the model from scratch on both tasks simultaneously (which sharing parameters). It is not clear this is the same as what is referred to in Sec. 8 as "joint training". Can you please explain more clearly what you refer to as joint training? - Why can't we find the same baselines in both Table 2 and Table 3? For example Table 2 is missing "joint training", and Table 3 is missing GRU trained on the target task. - While the idea is presented as a general method for transfer learning, experiments are focused on one domain (sentiment analysis on SemEval task). I think that either experiments should include applying the idea on at least one other different domain, or the writing of the paper should be modified to make the focus more specific to this domain/task.   Writing comments - The writing of the paper in general needs some improvement, but more specifically in the experiment section, where experiment setting and baselines should be explained more concisely. - Ensemble methodology paragraph does not fit the flow of the paper. I would rather explain it in the experiments section, rather than including it as part of your approach. - Table 1 seems like reporting cross-validation results, and I do not think is very informative to general reader. 
2. This paper proposes a regularization technique for neural network training that relies on having multiple related tasks or datasets in a transfer learning setting. The proposed technique is straightforward to describe and can also leverage external labeling systems perhaps based on logical rules. The paper is clearly written and the experiments seem relatively thorough.   Overall this is a nice paper but does not fully address how robust the proposed technique is. For each experiment there seems to be a slightly different application of the proposed technique, or a lot of ensembling and cross validation. I cant figure out if this is because the proposed technique does not work well in general and thus required a lot of fiddling to get right in experiments, or if this is simply an artifact of ad-hoc experiments to try and get the best performance overall. If more datasets or addressing this issue directly in discussion was able to show this the strengths and limitations of the proposed technique more clearly, this could be a great paper.   Overall the proposed method seems nice and possibly useful for other problems. However in the details of logical rule distillation and various experiment settings it seems like there is a lot of running the model many times or selecting a particular way of reusing the models and data that makes me wonder how robust the technique is or whether it requires a lot of trying various approaches, ensembling, or picking the best model from cross validation to show real gains. The authors could help by discussing this explicitly for all experiments in one place rather than listing the various choices / approaches in each experiment. As an example, these sorts of phrases make me very unsure how reliable the method is in practice versus how much the authors had to engineer this regularizer to perform well: We noticed that equation 8 is actually prone to overfitting away from a good solution on the test set although it often finds a pretty good one early in training.   The introduction section should first review the definitions of transfer learning vs multi-task learning to make the discussion more clear. It also deems justification why catastrophic forgetting is actually a problem. If the final target task is the only thing of interest then forgetting the source task is not an issue and the authors should motivate why forgetting matters in their setting. This paper explores sequential transfer so its not obvious why forgetting the source task matters.  Section 7 introduces the logical rules engine in a fairly specific context. Rather it would be good state more generally what this system entails to help people figure out how this method would apply to other problems.
3. This paper introduces a new method for transfer learning that avoids the catastrophic forgetting problem.  It also describes an ensembling strategy for combining models that were learned using transfer learning from different sources. It puts all of this together in the context of recurrent neural networks for text analytics problems, to achieve new state-of-the-art results for a subtask of the SemEval 2016 competition. As the paper acknowledges, 1.5% improvement over the state-of-the-art is somewhat disappointing considering that it uses an ensemble of 5 quite different networks.  These are interesting contributions, but due to the many pieces, unfortunately, the paper does not seem to have a clear focus. From the title and abstract/conclusion I would've expected a focus on the transfer learning problem. However, the description of the authors' approach is merely a page, and its evaluation is only another page. In order to show that this idea is a new methodological advance,  it would've been good to show that it also works in at least one other application (e.g., just some multi-task supervised learning problem). Rather, the paper takes a quite domain-specific approach and discusses the pieces the authors used to obtain state-of-the-art performance for one problem. That is OK, but I would've rather expected that from a paper called something like "Improved knowledge transfer and distillation for text analytics". If accepted, I encourage the authors to change the title to something along those lines.  The many pieces also made it hard for me to follow the authors' train of thought. I'm sure the authors had a good reason for their section ordering, but I didn't see the red thread in it. How about re-organizing the sections as follows to discuss one contribution at a time? 1,2,4,3,8 including 6, put 9 into an appendix and point to it from here, 7, 5, 10. That would first discuss the transfer learning piece (4, and experiments potentially in a subsection with previous sections 3,8,6), then discuss the distillation of logical rules (7), and then discuss ensembling and experiments for it (5 and 10). One clue that the current structure is suboptimal is that there are 11 sections...  I like the authors' idea for transfer learning without catastropic forgetting, and I must admit I would've rather liked to read a paper solely about that (studying where it works, and where it fails) than about the many other topics of the paper. I weakly vote for acceptance since I like the ideas, but if the paper does not make it in, I would suggest that the authors consider splitting it into two papers, each of which could hopefully be more focused.   

### Meta-review
The proposed approach is not consistently applied for the different experiments; this significantly harms the overall value of the research. The results are also quite domain-specific, and it is not clear if the findings would hold more generally. The paper is not clearly organised or written and does not give a specific enough introduction to the field of transfer learning.

## Train
python main_train_eval.py --model_name_or_path /path --do_train True --do_eval False --task summarization --train_file peer_data/train.csv --validation_file peer_data/validation.csv --output_dir experiment_decsion_at_decoder --decision_label decision --overwrite_output_dir --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --predict_with_generate --summary_column metareview --do_predict False

## Link 
https://github.com/anonymous12-lab/decison_aware_multi_source
