BOOK-CHAPTER

Mining Requirements and Design Documents in Software Repositories Using Natural Language Processing and Machine Learning Approaches

Abstract

Context: Mining an unstructured software repository entails the analysis and thorough understanding of data for producing quality software products to users' satisfaction. However, the vast number of information coming into the repository makes it difficult to get timely and error-free information due to the data's unstructured and noisy nature. Consequently, hindering the timely completion of a development project and leading to an inevitable delay in the delivery time. Objective: The chapter aims at developing a recommendation system that will help software developers produce a new product of good quality. The overarching goal is to understand and resolve the challenges, complexities and peculiarities of data in software repositories. Hence, software developers can locate useful data in a software development project without wasting of time. Methods: We adopted a quantitative research approach by experimenting with a developed tool called System Analysis and Mining List Recommendation System (SAMLRS). We used PyDriller to collect data from GitHub; precisely, one thousand (1,000) projects were extracted from GitHub. The dataset was processed and analyzed using Google BigQuery and Natural Language Processing (NLP). We formulated a model that handles the unstructured data in the repositories using Artificial Neural Network (ANN) technique. The model was specified in Unified Modeling Language, and the developed system was implemented using Python programming language. We used Recall, Precision and Execution time as parameters to evaluate the performance of the model. Results: The result showed a timely recommended list of requirements documents upon programmers' requests. Out of 1,000 repositories extracted from GitHub with the row dataset, more than 700 repositories were well structured. We obtained a performance improvement of 75% in terms of structuring data in the repository. Also, we got 84% performance improvement in terms of data recommendation with 1.98 seconds of execution time. Our results imply that programmers can locate functional requirements and design documents more effectively and efficiently.

Keywords:
Computer science Natural language processing Software Artificial intelligence Software engineering Programming language

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
0.50
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Software Engineering Research
Physical Sciences →  Computer Science →  Information Systems
Software Engineering Techniques and Practices
Physical Sciences →  Computer Science →  Information Systems
Big Data and Business Intelligence
Social Sciences →  Business, Management and Accounting →  Management Information Systems
© 2026 ScienceGate Book Chapters — All rights reserved.