AI Meets a 96 Years Old NGO: Improving Case Management for Cross-Border Child Protection

ISS has a global network of expertise in providing essential social services to children across borders — mainly in the domains of child protection and migration. However, with over 70,000 open cases per year, ISS caseworkers were facing challenges in two aspects: managing time, and managing data. The challenges were often exacerbated by administrative backlogs and a high turnover rate not uncommon in the nonprofit sector.

If we could find a way to significantly reduce the percentage of time lost on repetitive administrative work, we could focus on the more direct, high-impact tasks, helping more children and families access better quality services. ISS saw an urgent need for a technological transformation in the way they managed cases — and this is where Omdena came into the picture.

The overarching goal of this project was to improve the quality of case management and avoid unnecessary delays in service. Our main question remained, how can we help save caseworkers’ time and leverage their data in a meaningful way?

It was time to break down the problem into more specific targets. We identified factors that hinder ISS caseworkers from focusing on the client-facing activities, seen in the following picture.

Graphics by fellow collaborator Bima Putra Pratama

We saw that these subproblems could each be solved with different tools, which would help organizations like ISS understand the various ways that machine learning can be integrated with their existing system.

We also concluded that if we could manage data in a more streamlined manner, we could manage time more efficiently. Therefore, we decided that introducing a sample database system would prove to be beneficial.

The biggest challenge we faced as a team was the data shortage. ISS had a strict confidentiality agreement with their clients, which meant they couldn’t simply give us raw case files and expose private information.

Initially, ISS gave us five completed case files with names either masked or altered. As manually editing cases would have taken up caseworkers’ client-facing time, our team at Omdena had to find another way to augment data.

Our team collectively tackled the main problem from various angles, as follows:

Graphics by fellow collaborator Bima Putra Pratama

As we had only five data points to work with and were not authorized to access ISS’s data pool, we clarified that our final product would be a proof-of-concept rather than a production-ready system.

Additionally, keeping in mind that our product was going to be used by caseworkers who may not have a technical background, we consolidated the final deliverables into a web application with a simple user interface.

Due to limited cases available at the start of the project, the first task in hand was to collect more children related cases from various sources. We majorly concentrated on child abuse and migration cases. We gathered the case files and success stories that were publicly available on ISS partner websites, Malawi’s Child Protection Training Manual, Bolt Burdon Kemp, and Act For Kids. We also collected a catalog of child welfare court cases from the European Court of Human Rights (ECHR), WorldCourt, US Case Law, and LawCite. In the end, we had managed to collate a dataset of about 230 cases and were ready to utilize these in our project pipeline.

We relied on a supervised learning approach for our risk score prediction model. For this, we manually labeled risk scores for each of our cases. A risk score, a float value ranging from 0 to 1, intends to highlight priority cases (i.e. cases which involve a threat to a child’s wellbeing or have tight time constraints) and require the immediate attention of the caseworkers.

The scores were given by taking into consideration various factors such as the presence and a history of abuse within the child’s network, access to education, caretaker’s willingness to care for the child, and so on. To reduce bias, three collaborators provided their risk score input for each case, and the average of the three was considered as the final risk score for that case.

Manual risk score assignment process

Finally, we demarcated the risk scores into three categories, using the following threshold.

Additionally, we augmented our data — which originally only contained case text — by adding extra information such as case open and close dates, type of service requested, and country where the service is requested. Using this, we created a seed file to populate our sample database. These parameters would later help caseworkers see how applying a simple database search and filter system can enable dynamic data retrieval.

Preprocessing

Next, we moved to data preprocessing which is crucial in any data project pipeline. To generate clean, formatted data, we implemented the following steps:

Text Cleaning: Since the case texts were pulled from different sources, we had different sets of noises to remove, including special characters, unnecessary numbers, and section titles.
Lowercasing: We converted the text to lower case to avoid multiple copies of the same words.
Tokenization: Case text was further converted into tokens of sentences and words to access them individually.
Stop word Removal: As stop words did not contribute to certain solutions that we worked on, we considered it wise to remove them.
Lemmatization: For certain tasks like keyword and risk factor extraction, it was necessary to reduce the word to its lemmatized form (eg. “crying” to “cry,” “abused” to “abuse”), so that the words with the same root are not addressed multiple times.

We had to convert the case texts into compact numerical representations of fixed lengths to make them machine-readable. We considered four different types of embedding methods — Term Frequency Inverse Document Frequency (TFIDF), Doc2Vec, Universal Sentence Encoder (USE), and Bidirectional Encoder Representations (BERT).

To choose the one that works best for our case, we embedded all cases using each embedding method. Next, we reduced the embedding vector size to 100 dimensions using Principal Component Analysis (PCA). Then, we used a hierarchical clustering method to group similar cases as clusters. To find the optimal number of clusters for our problem, we referred to the dendrogram plot. We finally evaluated the quality of the clusters using Silhouette scores.

After performing these steps for all four algorithms, we observed the highest Silhouette score for USE embeddings, which was then selected as our embedding model.

Text Summarization

Multiple pre-trained extractive summarizers were tried, including BART, XLNet, BERT-SUM, and GPT-2 which were made available thanks to the HuggingFace Transformers library. As evaluation metrics such as ROUGE-N and BLEU required a lot more reference summaries than what we had, we opted for relative performance comparison and checked for the quality and noise level of each model’s outcomes. Then, inference speed played a major role in determining the final model for our use case, which was XLNet.

Time each model took to produce a sample summary, in seconds

Keyword & Entity Relation Extraction

Keywords were obtained from each case file using RAKE, a keyword extraction algorithm that determines high-importance phrases based on their frequencies in relation to other words in the text.

For entity relations, several techniques using OpenIE and AllenNLP were tried, but they each had their own set of drawbacks, such as producing instances of repetitive information. So we implemented our own custom relation extractor utilizing spaCy, which better-identified subject and object nodes as well as their relationships based on root dependency.

Entity relation graph made via Plotly

Similarity Clustering

The pairwise similarity was computed between a given case and the rest of the data based on USE embeddings. Among Euclidean distance, Manhattan distance, and cosine similarity, we chose cosine similarity as our distance metric for two reasons.

First, it works well with unnormalized data. Second, it takes into account the orientation (i.e. angle between the embedding vectors) rather than the magnitude of the distance between the vectors. This was favorable for our task as we had cases of various lengths, and needed to avoid missing out on cases with diluted yet similar embeddings.

After getting similarity scores for all cases in our database, we fetched top five cases that had the highest similarity values to the input case.

Risk Score Prediction

A number of regression models were trained using document embeddings as input and manually labeled risk scores as output. Tensorflow’s AutoKeras, Keras, and XGBoost were some of the libraries used. The best performing model — our custom Keras neural network sequential model — was selected based on root mean square error (RMSE).

Comparison of risk prediction model accuracies

Abuse Type & Risk Factor Extraction

We created more domain-specific tools to generate another source of insight via algorithms to find primary abuse types and risk factors.

For the abuse type extraction, we defined eight abuse-related verb categories such as “beat,” “molest,” and “neglect.” spaCy’s pre-trained English model en_core_web_lg and part-of-speech (POS) tagging were used to extract verbs and transform them into word vectors. Using cosine similarity, we compared each verb against the eight categories to find which abuse types most accurately capture the topic of the case.

Risk factor extraction works in a similar way, in that text was also tokenized and preprocessed using spaCy. This algorithm, however, further extended the previous abuse verbs by including additional risk words, such as “trauma,” “sick,” “war,” and “lack.” Instead of only looking at verbs, we compared each word in the case text (excluding custom stop words) against the risk words. Words that had a similarity score of over 0.65 with any of the risk factors were presented in their original form. This addition aimed to provide more transparency over what words may have affected the risk score.

To put these models altogether in a way that ISS caseworkers could easily understand and use, a simple user interface was developed using Flask, a lightweight Python web application framework. We also created forms via WTForms and graphs via Plotly, and let Bootstrap handle the overall stylization of the UI.

A Javascript code to implement Google Translate API was incorporated into the HTML templates, enabling the translation of any page within the app into 108 languages.

For the database, we used PostgreSQL, a relational database management system (RDBMS), along with SQLAlchemy, an object-relational mapper (ORM) that allows us to communicate with our database in a programmatic way. Our dataset, excluding the five confidential case files initially provided by ISS, was seeded into the database, which was then hosted on Amazon RDS.

The seeded database also includes fields like summary, risk score, and other model outcomes

Running models on a new case

Querying the database

A public Tableau dashboard to visualize the case files was also added, should caseworkers wish to refer to external resources and gain further insight on case outcomes.

Dashboard showcasing additional child court cases as an external point of reference (source)