Many populous countries including India are burdened with a considerable backlog of legal cases. Development of automated systems that could process legal documents and augment legal practitioners can mitigate this. However, there is a dearth of high-quality corpora that is needed to develop such data-driven systems. The problem gets even more pronounced in the case of low resource languages such as Hindi. Additionally one of the most common and time sensitive cases handled by the courts are
bail cases.
In this Thesis, we first introduce the Hindi Legal Documents Corpus (HLDC), a corpus of more than 900K legal documents in Hindi. Documents are cleaned and structured to enable the development of downstream applications. We then introduce and tackle the task of bail prediction. We select the bail cases from our HLDC corpus and further extract the facts and arguments and judge’s summary. We experiment with a battery of models and propose a Multi-Task Learning (MTL) based model for the same. Our MTL model uses summarization as an auxiliary task along with bail prediction as the main task. The intermediate summarisation step is a novel introduction which serves dual purposes.
First, it reduces the document size without compromising on the information. Since many transformer models have constraint on the input length, sending a summarised version of the documents allows us to overcome this barrier. Second, it builds towards explainable legal NLP systems as it allows us to identify salient sentences.
This Thesis lays the foundation for research in Legal NLP for Hindi court documents. The multitude of legal NLP tasks and challenges are indicative of the need for further research in this area.
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Justice Delayed is Justice Denied: Enabling Legal Artificial Intelligence via Bail Prediction on Hindi Case Documents
1. Justice Delayed is Justice Denied
Enabling Legal Artificial Intelligence via Bail
Prediction on Hindi Case Documents
Arnav Kapoor (20171040)
Advisor - Prof. Ponnurangam Kumaraguru
Examiners
Prof. Srijoni Sen
Prof. Saptarshi Ghosh
3. Enabling Legal Artificial Intelligence via Bail
Prediction on Hindi Case Documents
Legal Artificial Intelligence
Legal Artificial Intelligence (LegalAI) is concerned with the use of artificial
intelligence technologies particularly NLP, to legal task.
Bail Prediction
An automated system to predict if the Bail (judicial interim release of a person
suspected of a crime) would be granted or denied.
Hindi Case Documents
First work to focus on Hindi Legal documents
3
4. Motivation
The legal system plays a critical role in the functioning of a nation. It is paramount
that cases are handled judiciously and efficiently
In recent times, the legal system in many populous countries (e.g., India) has been
inundated with a large number of pending cases.
There is an imminent need to build systems to process legal documents and help
augment the legal procedure
4
5. - How many such pending cases are there and where are the majority of such cases
concentrated ?
- According to India’s National Judicial Data Grid, as of November 2021, there are
approximately 40 million cases pending in District Courts (National Judicial Data
Grid, 2021) as opposed to 5 million cases (1/8th) pending in High Courts.
- Need for developing models that could address the problems at the grass-root levels
of the Indian legal system.
‘Courts inundated with a large number of pending cases’
5
High Courts
District Courts
6. Judicial Structure in India
District Courts
District Court
High Courts
High Courts
SC
Supreme Court
- Similar to a few other countries like US and UK, India has adopted a tiered court-
system to improve productivity and streamline the judicial process.
6
7. District Court Challenges
Language - The case documents in District court use the local language of the
state, as opposed to high courts and supreme courts that consistently use English.
‘Article 348(1) of the Constitution of India provides that all proceedings in the
Supreme Court and in every High Court shall be in English language until
Parliament by law otherwise provides.’
Structure - Since each district has their own conventions and quality control check,
there is no consistency. Additionally district case documents are highly
unstructured and noisy (spelling and grammar mistakes since these are typed)
7
8. Prominent Legal Corpus
[Chalkidis et al., 2019] have developed an English corpus of European Court of
Justice documents. (Non-Indian context, English)
[Xiao et al., 2018] have developed Chinese Legal Document corpus. (Non-Indian
context, Chinese)
[Malik et al., 2021b] have developed an English corpus of Indian Supreme Court
documents. (Indian context, English)
To the best of our knowledge, there does not exist any legal document corpus for the
Hindi language 8
9. Contribution #1 - HLDC - Hindi Legal Document Corpus
About 40 million pending cases in the district courts of India. Out of the 40 million
pending cases, approximately 20 million are from courts where the official language is
Hindi.
No legal corpora exists for Hindi – HLDC (Hindi Legal Document Corpus)
Hindi Legal Documents Corpus (HLDC) is a corpus of 912,568 Indian legal case
documents in the Hindi language. The corpus is created by crawling data from the e-
Courts website , a GoI initiative to file, track and analyse case progress and documents
online.
10. HLDC - Corpus Pipeline
OCR
Document Segmentation (Header / Body )
Final Cleaning
10
12. Anonymisation Challenge
Anonymisation is a critical step of document processing. It has a two-fold purpose.
- Unbiased Models
- Privacy Concerns
We used a gazetteer along with regex-based rules for NER to anonymize the data.
List of first names, last names, middle names, locations, titles like पंिडत (Pandit: title of
Priest), सरजी (Sir), month names and day names were normalized to नाम (Naam:
<name>). Phone numbers were detected using regex patterns and replaced with a
<फ़ोन नंबर> (<phone-number>) tag, numbers written in both English and Hindi were
considered.
13
13. Standardisation Challenge - Case Type Identification
HLDC documents were processed to obtain different case types (e.g., Bail applications,
Criminal Cases)
However for the case documents, the uniformity for case type across districts and even
within district is little.
Bail applications was categorised as ‘Bail Appl’ or ‘Bail Application’ in Kanpur Nagar
district. Similarly, there were variations of ‘Civil Revision’ as ‘Civil Reivision’ and ‘Civill
Revision’ in multiple districts.
Resolved using a combination of manual verification along with edit distance metrics
14
15. Case Type Statistics
16
Case Type % in HLDC
Bail Applications 31.71
Criminal Cases 20.41
Original Suits 6.54
Warrants or Summons in Criminal Cases 5.24
Civil Misc 3.4
Final Report 3.32
Civil Cases 3.23
16. Contribution #2 - Bail Prediction
Showcase utility of the HLDC corpus on a relevant real world task.
The most prevalent case-type amongst the cases in the corpus was Bail cases.
Additionally bail cases are very time-sensitive in nature as the hearing and decision
needs to be completed in a short period. Bail cases on average are resolved within 7-
15 days as compared to most other cases which can takes years.
17
17. Bail Prediction
Bail Prediction Task
Input – Facts of the case
System – Bail Prediction ML Model
Output – Bail Granted || Bail Denied
Input
Facts and Arguments
of the case
System
Bail prediction model
Output
Bail is granted or
denied
18
19. Outcome Identification
Binary - Identify if bail is granted or denied
To extract the bail decision we manually identified keywords in result section.
Keywords like ख़ािरज (dismissed) and िनरस्त (invalidated/ cancelled) show rejection of
bail application and words like स्वीकार (accepted) identified acceptance of bail
application.
20
21. Judge’s Opinion and Facts Extraction
Judge’s Opinion Extraction - Most of the bail documents have a concluding
paragraph where the judge summarizes their viewpoints of the case. To
extract this, we first constructed certain regex which often precedes judge’s
summary.
Facts Extraction - The remaining case document after removal of judge’s
summary, final result paragraph and initial header information formed the
core facts of the case.
23. Formalised Problem Statement
Formally, consider a corpus of bail documents D = b1, b2, · · · , bi, where each bail
document is segmented as bi = (hi, fi, ji, yi).
Here, hi, fi, ji and yi represent the header, facts, judge’s summary and bail decision of
the document respectively. Additionally, the facts of every document contain k
sentences, more formally, fi = (si
1 , si
2, · · · , si
k), where si
k represents the kth
sentence of the ith bail document. We formulate the bail prediction task as a binary
classification problem.
We are interested in modelling pθ(yi|fi), which is the probability of the outcome yi
given the facts of a case fi. Here, yi ∈ {0, 1}, i.e., 0 if bail is denied or 1 if bail is
granted.
24
24. Models
We explore a multitude of models to solve the bail prediction problem. The proposed
Multi-Task Learning model performs the best. The different categories of models we
progressively explored were as follows :-
- Embedding Based Models
- Summarisation Based Models
- Multi Task Learning Model
25
25. Embedding Based Models
Doc2Vec
Classical embedding based model. Doc2Vec embeddings, in our case, is trained on the
train set of our corpus. The embeddings go as input to SVM and XGBoost classifiers.
IndicBERT
IndicBERT is a transformer language model trained on 12 major Indian languages.
However, IndicBERT, akin to other transformer LMs, has a limitation on the input’s
length (max 512 tokens). Experimented with 2 settings.
The first 512 tokens of the document.
The last 512 tokens of the document.
26
26. Summarisation Based Models
The models described in the previous section can lose important input signals due to
truncation which can be detrimental to the bail prediction task
Lengthy Documents —> Summarisation
Two fold benefits
a) Extractive summary of document as input i.e reducing the length of the
document, without losing important signals.
b) Summarization helps to evaluate the salient sentences in a legal document
and is a step towards developing explainable models 27
27. Summarisation Methods
Unsupervised - No need for ground truth summaries.
a) TF-IDF+IndicBERT: We rely on the statistical measures of term frequency
and inverse document frequency to recognise the most salient words and
sentences.
b) TextRank+IndicBERT: We use the graph based method TextRank to extract
the most important sentences from the documents
Semi-Supervised - Use the judge’s summary as a truth indicator to train the model.
- Salience Classifier + IndicBERT
28
28. Semi-Supervised Summarisation - Saliency Classifier
- Saliency Classifier - A binary sentence level classifier to know if the sentence is
important or not.
- We compute embeddings (using a multilingual sentence encoder) for each sentence in
the facts (source text) and the summarised judgement (silver standard), and assign a
saliency score to each sentence in the facts using cosine similarity.
- Formally – Salience(si
k ) = cos_sim( ji , si
k)
- The cosine similarities provides ranked list of sentences - top 50% salient, bottom half
as non-salient. We use this data to train the saliency classifier.
- The salient sentences are used to train (and fine-tune) IndicBERT based classifier.
29
29. Multi-Task Learning Model
Summarization based models show improvement in results. Instead of a 2 step -
summarisation + prediction, a multi task framework.
Bail prediction is the main task and sentence salience classification is the
auxiliary task.
31. Experiments and Results
We evaluate the models in two settings
a) All-district performance - 70:10:20 - train, validation and test split. Documents from all
districts used together.
b) District-wise performance (measure generalizability): Training on documents from 44
districts. Testing is done on a different set of 17 districts. The validation set has another set
of 10 districts. The number of documents in each category correspond to an approx
70:10:20 ratio.
All models are evaluated using standard accuracy and F1-score metric
32
34. Results
The performance of models is lower in the case of district-wise settings.
Possible Reason - Lexical variation across districts, difficult for the model to
generalize.
In general, summarization based models outperform better than Doc2Vec and
transformer-based models, highlighting the importance of the summarization step.
The multi-task model outperforms all the baselines in the district-wise setting with
78.53% accuracy. The auxiliary task of sentence salience classification helps learn
robust features during training.
However, in the case of an all-district split, the MTL model fails to beat simpler
baselines like TF-IDF+IndicBERT. The sentence salience training data based on
the cosine similarity heuristic, induce some noise for the auxiliary task.
35
35. Error Analysis
Kernel Density Estimate
(KDE) plots of our proposed
bail prediction model. The
majority of errors (incorrectly
dismissed / granted) are
borderline cases with model
output score around 0.5
36
36. Challenges and Limitations
Domain Specific Knowledge - Court documents contain very specific lexicon. Pre-
trained models built on other corpus don’t perform well.
Low Resource Language - Hindi, despite being the third most spoken language in
the world remains a low resource language in the context of NLP with limited high
quality training data and pre-trained models.
Length of Documents - Court cases can be extremely long in nature, the token
length constraint of transformer based models makes it challenging to process legal
documents.
37
37. Summary
We introduced a large corpus of legal documents for the under-
resourced language Hindi: Hindi Legal Documents Corpus (HLDC)
As a use-case for HLDC, we introduce the task of Bail Prediction - a
time-sensitive and impactful problem to tackle.
We experimented with several models and proposed a multi-task
learning based model that predicts salient sentences as an auxiliary task
and bail prediction as the main task
38
38. Future Work
Multiple Languages - India has 22 official languages, but for the majority of
languages, there are no legal corpora. The use of such corpora is not restricted only
to the legal domain. It can also act as a source of unsupervised data in low resource
languages to train transformer models.
Infusing Domain Specific Knowledge: Develop deep models infused with legal
knowledge so that model is able to capture legal nuances. For Indian legal
documents, we can incorporate information from the sections and acts in the IPC
(Indian Penal Code).
Other LegalNLP tasks: Apart from judgement prediction, a number of other tasks
like legal summarisation and prior case retrieval can help expedite judicial process.
39
39. Publications
Kapoor, A., Dhawan, M., Goel, A., Arjun, T.H, Agrawal, V., Agrawal, A.,
Bhattacharya, A., Kumaraguru, P., and Modi, A.
HLDC: Hindi Legal Documents Corpus. In Findings of the Association for
Computational Linguistics: ACL-IJCNLP 2022 , Association for Computational
Linguistics.
40
40. Acknowledgements
Prof. Ashutosh Modi and Prof. Arnab Bhattacharya - The work was a collaborative
effort between universities of IIIT Hyderabad, IIT Kanpur and IIIT Delhi. Thanks
for the opportunity to work on such an impactful problem.
IHUB-Data IIIT Hyderabad for funding my Masters.
The Precog group for the constant support and encouragement.
Family - For everything.
41