David Lillis :: Experimenting with Ensembles of Pre-Trained Language Models for Classification of Custom Legal Datasets

Experimenting with Ensembles of Pre-Trained Language Models for Classification of Custom Legal Datasets

Tamara Matthews and David Lillis

Conference In Proceedings of the 5th International Conference on Natural Language and Speech Processing (ICNLSP 2022), pages 68--78, Trento, Italy, 2022. Association for Computational Linguistics.

Abstract

Document corpora owned by law and regula- tory firms pose significant challenges for text classification; being multi-labelled, highly imbalanced, often having a relatively small number of instances and a large word count per instance. Deep learning ensemble methods can improve generalization and performance for multi-label text classification but using pre-trained language models as base learners leads to high computational costs. To tackle the imbalance problem and improve generalization we present a fast, pseudo-stratified sub-sampling method that we use to extract diverse data subsets to create base models for deep ensembles based on fine-tuned models from pre-trained transformers with moderate computational cost such as BERT, RoBERTa, XLNet and Albert. A key feature of the sub-sampling method is that it preserves the characteristics of the entire dataset (particularly the labels' frequency distribution) while extracting subsets. This sub-sampling method is also used to extract smaller size custom datasets from the freely available LexGLUE legal text corpora. We discuss approaches used and classification performance results with deep learning ensembles, illustrating the effectiveness of our approach on the above custom datasets.