Applying machine learning to social datasets: a study of migration in southwestern Bangladesh using random forests Journal Article uri icon



  • AbstractAs researchers collect large amounts of data in the social sciences through household surveys, challenges may arise in how best to analyze such datasets, especially where motivating theories are unclear or conflicting. New analytical methods may be necessary to extract information from these datasets. Machine learning techniques are promising methods for identifying patterns in large datasets, but have not yet been widely used to identify important variables in social surveys with many questions. To demonstrate the potential of machine learning to analyze large social datasets, we apply machine learning techniques to the study of migration in Bangladesh. The complexity of migration decisions makes them suitable for analysis with machine learning techniques, which enable pattern identification in large datasets with many covariates. In this paper, we apply random forest methods to analyzing a large survey which captures approximately 2000 variables from approximately 1700 households in southwestern Bangladesh. Our analysis ranked the covariates in the dataset in terms of their predictive power for migration decisions. The results identified the most important covariates, but there exists a tradeoff between predictive ability and interpretability. To address this tradeoff, random forests and other machine learning algorithms may be especially useful in combination with more traditional regression methods. To develop insights into how the important variables identified by the random forest algorithm impact migration, we performed a survival analysis of household time to first migration. With this combined analysis, we found that variables related to wealth and household composition are important predictors of migration. Such multi-methods approaches may help to shed light on factors contributing to migration and non-migration.

publication date

  • June 1, 2022

has restriction

  • hybrid

Date in CU Experts

  • January 29, 2023 6:38 AM

Full Author List

  • Best K; Gilligan J; Baroud H; Carrico A; Donato K; Mallick B

author count

  • 6

Other Profiles

International Standard Serial Number (ISSN)

  • 1436-3798

Electronic International Standard Serial Number (EISSN)

  • 1436-378X

Additional Document Info


  • 22


  • 2


  • 52