Security methods for malware detection consume high resources. Fundamental information of malicious applications can be obtained from user feedback of app reviews. The paper presents a frame in machine learning and explores app metadata and the sentiment of the reviews to aid in the identification of malware. Thus, the dataset of 10,841 apps with 64,295 user reviews from Google Play Store was analyzed with TF-IDF vectorization and feature engineering in a sentiment aggregation. Random Forest, SVM along with XGBoost make up an ensemble of input to the model. SMOTE was used to handle class imbalance, and optimization of the threshold was carried out. Consequently, although the model outperformed static feature only systems with the classification accuracy rate of 91% and AUC value of 0.98, after optimizing with thresholds. Analysis of users’ feedback in this research demonstrates that user feedback analysis is a powerful mechanism that can allow us to build malware detection systems to scale up in such a way that we can proactively protect the users from threats. The future is in developing deep learning NLP models and real-time risk assessment platforms which improve on detecting applications in their various application contexts.
This research implements a well-designed methodology which allows for reproducible detection of harmful software through user evaluations combined with metadata analysis. Methdology outlines the selection of tools and instruments together with preprocessing methods for data enhancement and justifies the selected model selection process and describes training procedures while explaining strategies for bias reduction and performance optimization.
The ensemble model which combines Random Forest with Support Vector Machine (SVM) and XGBoost reaches an area under the ROC curve value of 0.98 for effective malicious application detection. The threshold optimization process led to better classification accuracy which rose from 88% to 91% as it improved the precision-recall ratio.