Unlocking Legal Litigation Analysis with chatGPT
A Step-by-Step Tutorial
July 11th, 2023
The legal field is notorious for its vast amounts of complex and unstructured textual data. Legal professionals spend countless hours poring over legal documents, searching for relevant information, and extracting key insights to build persuasive arguments. However, this traditional approach to legal analysis is time-consuming, labor-intensive, and often prone to human error. Law firms typically employs junior analysts to sift through relevant and irrelevant documents, annotate relevant sections and information. A process that can take months, significantly delaying the completion of the case.
In recent years, the rapid advancement of NLP techniques, combined with the recent advent of large language models such as OpenAI’s GPT-4, has paved the way for more efficient and accurate legal analysis. These sophisticated models can not only extract named entities, such as people, organizations, locations, events, laws, etc. but also generate concise summaries and identify critical facts buried within extensive legal documents. By automating these time-consuming tasks, legal professionals can now focus on higher-level strategic analysis and decision-making.
Moreover, by embedding the enriched legal documents into a high-dimensional vector space, additional analysis can be performed. Similarity analysis allows for the comparison of multiple cases based on their facts, revealing patterns and relationships that might otherwise be missed. Clustering techniques can group similar cases together, aiding in the identification of precedents or relevant legal arguments. Last but not least, one can apply predictive analytics to predict the outcome of the case based on its facts and entities. The possibilities are limitless.
In this tutorial, we are going to extract relevant information from environmental litigation cases such as named entities, facts presented and summarization using the LLM GPT-3.5-Tubo.
During the litigation process, an enormous amount of effort is put into analyzing previous cases searching for evidence that will support the current case. Traditional keyword search with Ctrl + F is not enough to yield accurate and complete results since it requires knowing the keyword to search for beforehand, as the saying goes “you don’t know what you don’t know”. A better way is to use conceptual search based on entities. For example, you could look for all contaminants as well as chemicals, organization and people mentioned in the case. This can be done easily by training an Named Entity Recognition Model (NER) on a small training dataset. For more information, please read this article.
For this tutorial, we are going to use the dataset from the climate case chart website that has all the US climate litigation cases: http://climatecasechart.com/case-category/clean-water-act/
The Facts section in particular is important, since it contains the factual background about the case, which can be summarized, extract relevant entities and perform vector embedding to find similar cases based on facts.
For this tutorial, we are going to analyze the entire document to:
- Extract entities such as Organization, Person, Law case, Location, Evidence, etc.
- Summarize the case
- Extract facts from the case
Let’s get started!
Building NLP Workflow for Litigation Case
To be able to identify the entities mentioned in the previous section, we will typically need to create a labeled dataset and train an AI model to automatically identify the entities mentioned in the document. Another alternative, is to use the zero-shot learning (ZSL)capability of LLM such as GPT to detect the entities we are interested in without any training required. This process, typically called, in-context learning, leverages LLM’s knowledge base to infer the labels from the document.
To be able to do entity extraction, summarization and fact extraction in one shot, we are going to use the newly launched app by UBIAI called AI Builder. AI Builder is built to simplify and abstract the complexity of building NLP workflow using just few clicks. You can learn more about the new app here.
Using it’s workflow interface, we can start assembling our modules together to create the desired workflow:
AI Builder workflow creation interface
- The first step of the workflow is importing our PDF using the “Import Text” module.
- Second, we feed the text to the “Zero Shot NER” module which will extract all the entities we need without any training required. Under the hood, the module calls OpenAI API with a prompt containing the entities input by the user and then parses the output to auto-tag the document with the labels.
Zero shot NER prompt interface
- Third, we feed the entire document, defined by the variable [[input_text]] below, to the LLM to summarize the text.
Summarizer prompt interface.
- Finally, we extract facts by prompting the LLM. Notice we have explicitly asked the LLM to not “change the original words” because it is notoriously known for hallucinating false information. This step requires more prompt engineering to get it right.
Fact identifier prompt interface
Below is the entire workflow created in AI Builder by drag-and-dropping the modules and logically connecting them together.
Litigation case analusis workflow in AI Builder
We are now ready to run the workflow and get the structured data.
Litigation Case Analysis with NLP
To run the workflow, we simply drag-and-drop all the files we would like to process and press the run button. We can process thousands of documents all at once in just few minutes.
Below is the result for one litigation case:
Analyzed output in AI Builder
Impressively, the Zero-shot NER model was able to identify many correct entities such as the laws mentioned, organization, people and any mention of law cases in the document. We can export the structured data in excel format as well:
NER csv output
However, looking closely at the results, we notice few mistakes made by the LLM such as identifying CWA as organization.
Mistakes made by the zero shot NER model
This limitation is indeed expected as the LLM was trained on generic dataset across the entire internet and not specifically on our entities of interest and will typically underperforms compared to training supervised NER models. Nevertheless, we can review the model extraction and correct them using the Review dashboard in the AI Builder.
AI Builder review interface
Once the data has been reviewed and corrected, we can use it to train a specialized NER model right from the platform:
AI Builder training configuration window
This human-in-the-loop approach will enable us to take full advantage of zero-shot learning capability to help us create a training dataset for training a more accurate NER model.
The summary of the document by the LLM is quite accurate explaining the main facts of the case in a short paragraph!
Summarizer: The document is a complaint filed in the United States District Court, challenging the U.S. Army Corps of Engineers' issuance of a Clean Water Act permit for the construction of the Cainhoy Plantation in South Carolina. The plaintiffs, including the South Carolina Coastal Conservation League and Charleston Waterkeeper, argue that the permit was unlawfully issued and violates the CWA's requirement to consider the least environmentally damaging practicable alternative. They also claim that the Corps failed to comply with the mandates of the National Environmental Policy Act and that the U.S. Fish and Wildlife Service unlawfully issued a Biological Opinion authorizing the destruction of endangered species' habitat. The plaintiffs seek to vacate the permit and enjoin the defendants from proceeding with the development until they comply with the CWA, NEPA, the Endangered Species Act, and the Administrative Procedure Act.
Below is a small snippet of the extracted facts using LLM. Although the facts were correctly extracted from the correct section in the document , a significant number of them were missing. This could be due to the small context length of the LLM that abruptly stopped the generation of text. A better approach here, will be to train a sentence classifier instead.
96. However, the Corps own wetland delineation of Cainhoy found a total of 4,546.2 acres of uplands, over 1,000 acres short of the number cited in the EA and almost 500 acres short of the criterion the Corps evaluated and determined to be necessary for any alternative to be suitable. Id. at 100. 97. The Corps thus permitted the Cainhoy development despite this alternative's failure to meet the agency's own criteria for satisfying the project purpose. 98. The Corps also permitted the Cainhoy development when the proposal was not the least environmentally damaging practicable alternative. 99. The Corps found that the Dover [Kohl] Plan could be a feasible development design, yet went on to reject this alternative as not the intent of the applicants, nor what the applicants have proposed... Decision Document at 133 (emphasis in original). 100. The only criterion that the Dover Kohl plan fails to satisfy is the site size criterion. But the selected Cainhoy development also fails to satisfy this criterion, as discussed above in paragraphs 96–97. 101. Of the other criteria, the Dover Kohl plan satisfies each one: it is located within 50 miles of Charleston with a commuter time of 45 minutes or less, it is within a 10-mile radius of an interstate or four-lane major highway, it is attainable, it has access to utilities infrastructure, and it is appropriately zoned for mixed-use development. 102. Further, the Dover Kohl plan satisfies the project purpose because it would construct a mixed-use development to include residential, commercial, educational, office and government facilities with access to major traffic arteries that has the demographic support, zoning, infrastructure, and access to schools and hospitals to help meet the needs of the growing Charleston, South Carolina metropolitan area. 103. The Corps concluded that because the Dover Plan is inconsistent with the overall project purpose, it is not considered to be a practicable on-site alternative for the Cainhoy Plantation property in light of overall project purpose. Id. at 133–34. 104. The Corps did not cite any specific reason why the Dover Kohl plan was inconsistent with the overall project purpose.
In this article, we present a step by step tutorial on how to extract relevant information from environmental litigation cases by chaining multiple LLMs for NER, summarization and fact extraction into one single powerful workflow.
As future steps, we can embed the enriched legal documents into high-dimensional vector spaces to opens up opportunities for additional analysis. For example, similarity analysis can reveal patterns and relationships between multiple cases, aiding in the identification of precedents and relevant legal arguments. Clustering techniques help group similar cases together, while predictive analytics can be applied to forecast case outcomes based on their facts and entities.
If you are looking to create your own workflow, schedule a demo with us here.