This data science capstone project developed an automated system to classify World Bank projects and summarize their activities, specifically focusing on broadband connectivity. The project addressed the challenge of encapsulating the World Bank’s impact due to the complexity and quantity of its projects.
The solution involved a three-step process:
Classification: Two approaches, a Naïve approach and a Large Language Model (LLM) approach (using BERT and Llama models), were employed to classify projects based on whether they relate to broadband connectivity by analyzing indicator names.
Filtering: Classified indicators were refined to ensure only the most up-to-date observations were kept and that ‘Progress Values’ were numerical.
Aggregation: Progress for each indicator was computed and converted to counts where necessary.
The project compared the Naïve, BERT, and Llama models, identifying their strengths and weaknesses in classification. Future steps include a hybrid classification approach combining BERT and TF-IDF, refining filtering and aggregation methods for LLM outputs, and leveraging pre-labeled data from the World Bank for enhanced model training and evaluation. Additionally, the BART-large-mnli model showed promise towards the end of the project, and could further enhance classification capabilities.
Due to an NDA, I can not share the github repository or data.