Journal Article

A “Roziah” by Any Other Name: A Simple Bayesian Method for Determining Ethnicity From Names

Kridaraan Komahan and Daniel D. Reidpath

in American Journal of Epidemiology

Published on behalf of Johns Hopkins Bloomberg School of Public Health

Volume 180, issue 3, pages 325-329
Published in print August 2014 | ISSN: 0002-9262
Published online June 2014 | e-ISSN: 1476-6256 | DOI: http://dx.doi.org/10.1093/aje/kwu129
A “Roziah” by Any Other Name: A Simple Bayesian Method for Determining Ethnicity From Names

Show Summary Details

Preview

Correct identification of ethnicity is central to many epidemiologic analyses. Unfortunately, ethnicity data are often missing. Successful classification typically relies on large databases (n > 500,000 names) of known name-ethnicity associations. We propose an alternative naïve Bayesian strategy that uses substrings of full names. Name and ethnicity data for Malays, Indians, and Chinese were provided by a health and demographic surveillance site operating in Malaysia from 2011–2013. The data comprised a training data set (n = 10,104) and a test data set (n = 9,992). Names were spliced into contiguous 3-letter substrings, and these were used as the basis for the Bayesian analysis. Performance was evaluated on both data sets using Cohen's κ and measures of sensitivity and specificity. There was little difference between the classification performance in the training and test data (κ = 0.93 and 0.94, respectively). For the test data, the sensitivity values for the Malay, Indian, and Chinese names were 0.997, 0.855, and 0.932, respectively, and the specificity values were 0.907, 0.998, and 0.997, respectively. A naïve Bayesian strategy for the classification of ethnicity is promising. It performs at least as well as more sophisticated approaches. The possible application to smaller data sets is particularly appealing. Further research examining other substring lengths and other ethnic groups is warranted.

Keywords: classification; ethnicity; naïve Bayesian approach

Journal Article.  3156 words. 

Subjects: Public Health and Epidemiology

Full text: subscription required

How to subscribe Recommend to my Librarian

Users without a subscription are not able to see the full content. Please, subscribe or login to access all content.