Where do we fit in?

Sucheta Jawalkar
6 min readOct 19, 2017

The Insight Data Science blog recommends the best way to put your data science tools to work is to

Determine a question you may be interested in answering by using data. What are you passionate about? What would make your life more efficient? What would improve the city you live in?

then use your Data Science tools to answer that question.

My partner and I often talk about what city we could move to or where we could live in the U.S. that would be a good fit for our values and priorities. We are a mixed couple with graduate degrees and a biracial child.

On the quest to “fit in” I asked the question, “where are the people that are like us?”

That is my daughter below with her “question” face also similar to “confused” face and “I am not eating that” face.

Preprocessing

I generated a census API key and queried data from the 2000 Census Summary File 3 and used Pandas and the SciKit Maching Learning package in a Jupyter notebook to look at 23 census variables for 3219 counties in the U.S.

I imported the csv file into a Jupyter notebook. The data frame above includes population based on race, marital couples with children under 18, home ownership, education level, and aggregate family income. For example, the variable name H011002 in the above data frame corresponds to owner occupied housing units.

I then created a smaller data frame of 17 columns to look at populations of non-white citizens with family income over $100K, black males over age 25 and a graduate degree, and Asian females over age 25 and a graduate degree. To make sense of these data I wanted to be able to visualize or rank it in some way.

PCA

I used Principal Component Analysis (PCA) method to reduce the dimensions of the data from 17 to 2. The first step was to find the mean and variance of the data and normalize each data column.

The next step was to compute the covariance matrix and get the U matrix using single value decomposition. After implementing this, I found that the Sci-Kit package had a PCA library that could be used directly so I used that and the tools that came with it to look at the variance captured by the first two principal components.

The first principal components captures over 80% of the variance and the second is around 8.4%. It would be nice if the numbers were higher but their sum was close enough to 90% so I felt comfortable using this is an good enough method to make sense of these data.

I used Sebastian Raschka’s PCA in the 3 Simple steps method to get the eigenvectors for the new feature space and reconstruct 3219 rows of data in the new feature space.

Ranking PCA 1

I took the reconstructed data in the Y matrix above and appended a counter starting at 0 to match the data frame with the raw data. I then sorted the new feature array using the first Principal component. I struggled with if this was the correct thing to do even though it seemed like the natural thing to do. I would love some feedback here! Looking at the results was somewhat comforting.

I selected the first 10 and last 10 rows of the column and matched the counter in column 3 to the counter in the raw data to get the top 10 and bottom 10 counties.

Results

Here are my results!

The results of the first component ranking indicates several California counties, Queens NY, Phoenix and Chicago area. All seem like diverse population counties. I intentionally did not normalize the data because we favor large cities with major airports.

Double Check

As this was my first independent data science project and I needed some type of check to look for gross errors. I repeated the analysis above selecting white populations, white homeowners, families with children under 18, and aggregate family income over 100K. I have a data frame of 10 columns for the 3219 counties.

I was expecting at least a few common counties. The results for this new smaller data frame were!

I had multiple counties match!

PCA Second Component

I wanted to investigate further and come up with an interpretation for the second principal component from my original census data sample. I used the housing_edit data frame for the results below. As a first attempt, I looked at the top and bottom ten counties using the second principal component ranking. The results for the bottom 10 included several Bay Area counties, some even from the top 10 of the first component ranking!

Here is a list of the top 10. Not a single California county!

All diverse city areas with a large population of educated, black males. PCA 1 is a general measure of asian or mixed, high population areas whereas PCA 2 is partial to counties with large educated population of black males over 25. Tech recruiters — here are the counties you can recruit from!

Next Steps

The next steps for this project would be run the code for the 1960, 1970, 1980, 1990, and 2010 census data files and plot the ranking for the TOP and BOTTOM 10 counties as a function of time.

I could then fit the time dependencies with a realistic model and predict what the ranking would be for the next few decades.

Conclusion

I was stunned that the data tend to confirm the Bay Area Tech diversity problem. I also realized that I still wasn’t sure I wanted to live in any of those counties listed in my TOP 10. I learned that maybe I do not care all that much about “fitting in”.

In retrospect, maybe I should have asked “How far do I want to travel for Thanksgiving?”, or “Where could me and my partner both have happy careers?”, “Where would we find a place so my family is treated with kindness and civility?”

I hope you had fun reading this! I would love to hear any feedback you have.

--

--

Sucheta Jawalkar

Data Science @CVSHealth Physicist/DataScientist/ Wife/Mom/ChurnedAcademic. Find out more about me here! https://www.linkedin.com/in/suchetajawalkar/