Worker Flows

May 23, 2017 · 7 minute read

For a research project I wanted to answer the question “What percentage of people move each year?” where “move” is defined as change industry or change location. I spent some time looking for data to help me answer the question and found three main resources:

  1. The current population survey (CPS) from the US Census Bureau. The CPS reports education, demographic, geographic, and employment data for about 60,000 individuals each month. Each participant responds for four consecutive months, then waits for eight months without responding, then does one final set of four consecutive months of responses.
  2. The County Business Patterns (CBP), also from the census. The CBP annual reports industry composition information by county. The data includes the number of establishments in each industry as well as bins of total employment in that industry – meaning number of employees is reported in groups like “1 to 5”, “5 to 10, “10 to 20”, etc.
  3. The IRS migration data. This dataset uses the address and reported income on individual tax fillings to track how many individuals move in or out of a county. For inflows, the source county is reported. For outflows the destination county is reported. Each observation is a year, county in, county out, number of filings, total number of exemptions in those filings, and adjusted gross income of the cohort. The IRS claims the data represents 98% of all filings from 1990 to 2014.

3 Strategies

I came up with three strategies for fulfilling the want operator (answering the question): use the CPS, merge the CBP with the IRS data to track industry and geographic movement, just use the IRS to track geographic movement.

I wanted to start with the CPS because I really care about how likely an individual is to move and this is the only micro (individual-level) data set I had. However, I had a few issues:

Then I thought I could somehow merge the CBP and the IRS data thinking that the CBP would help me track industry mobility and the IRS would let me track geographic mobility. It turns out that merging two aggregate data sets is difficult. I didn’t spend too much time on it, but I couldn’t figure out how to make the inflow/outflow data from the IRS as well as the previous year/current year industry employee bins to the number of people who either moved or changed industries. Perhaps if the CBP reported numbers instead of bins I could have done it, but it wasn’t obvious to me how to work with the bins.

I eventually fell back on the third option, which was to just use the IRS data and forget about changing industries. This is equivalent to assuming that a worker will never switch industries without also switching the county of primary residence. While not exactly what I was going for, this doesn’t seem like too strong of an assumption. The final statistic I report will under-estimate the total fraction of people who “move” under my definition from above.

Using the IRS data

I wrote some Python code (forthcoming as of 2017-05-23 – once I find the right home for it I will release it) that will download and clean all the IRS migration data. The code takes about 3 minutes to run and produces two files inflows.feather and outflows.feather. These files store the data in a binary format called the feather format meant to be an efficient and cross-language means of serializing tabular data.

The end result of my analysis should be a single number representing the fraction of people who move each year. In order to compute this I used the code below.

Breaking form from the previous posts on this site, I will interleave code and output so that it is clear what is in the data

# import libraries
import feather
import pandas as pd

# read in outflows data
outflows = feather.read_dataframe("outflows.feather")
outflows.head()
state_from county_from state_to county_to returns exemptions agi year1
0 2 13 96 0 84 173 NaN 1990
1 2 13 2 20 12 24 14.29 1990
2 2 13 58 0 21 45 25.00 1990
3 2 13 59 0 41 88 48.81 1990
4 2 13 59 9 10 16 11.90 1990
fips_cz = feather.read_dataframe("fips2cz.feather")
fips_cz[["FIPS", "County Name"]].head()
FIPS County Name
0 1001 Autauga County
1 1003 Baldwin County
2 1005 Barbour County
3 1007 Bibb County
4 1009 Blount County
# Want to change the fips column into state and county code
# pad the FIPS column with leading 0's to make sure it is 5 digits.
# then the first two are the state and last three are the county
fips_code = fips_cz.FIPS.astype(str).str.zfill(5)
fips_cz["state"] = fips_code.str[:2].astype(int)
fips_cz["county"] = fips_code.str[2:].astype(int)

# also get a list of states so we can filter the inflows and outflows
# dataframes
states = fips_cz["state"].unique()
fips_cz[["FIPS", "County Name", "state", "county"]].head()
FIPS County Name state county
0 1001 Autauga County 1 1
1 1003 Baldwin County 1 3
2 1005 Barbour County 1 5
3 1007 Bibb County 1 7
4 1009 Blount County 1 9

We need to make two adjustments to the raw outflows DataFrame:

  1. We need to restrict the state_from column to be one of the US states so we don’t count things like the US itself, foreign countries, etc.
  2. Sometimes the returns column is a negative number. This is a flag that denotes the entry was suppressed for that column. The reason suppressing data is to preserve the confidentiality of individuals. The documentation says the following:

At the county level only, certain matched tax returns that represented a specified percentage of the total of any particular cell have been excluded. For example, if one return represented 75 percent of the value of a given cell, the return was suppressed from the county detail. The actual threshold percentage used cannot be released.

# restrict outflows to be flows from a state
out_states = outflows[outflows.state_from.isin(states)]

# filter out negative numbers
out_states = out_states[(out_states.returns > 0) & (out_states.exemptions > 0)]

Now, this dataset has aggregated rows representing total outflow from one county to all destinations. These are coded as having state code 96 and county code 0. The dataset also includes observations where the from and to state/county codes are the same. This represents the pool of people who didn’t move from that county in a given year. For all the starting counties and years, these are the only two rows we need to compute our desired statistic.

Our strategy will be as follows:

Here’s the code

all_mover_out_query = "state_to == 96 and county_to == 0"
gb_out = out_states.groupby(["state_from", "county_from"])

res_out = []
for name, group in gb_out:
    st, co = name
    all_movers = group.query(all_mover_out_query)[cols].set_index("year1")
    non_mover_query = f"state_to == {st} and county_to == {co}"
    non_movers = group.query(non_mover_query)[cols].set_index("year1")
    non_movers = non_movers.add_prefix("stay_")
    how_many = pd.concat([all_movers, non_movers], axis=1)
    how_many.reset_index(inplace=True)
    how_many["state"] = st
    how_many["county"] = co
    res_out.append(how_many)

final_out = pd.concat(res_out)
final_out.set_index(["state", "county", "year1"], inplace=True)

Now we can compute the main statistic we are after.

# some years didn't have observations -- drop them
clean_out = final_out.dropna()

# compute the statistic using the "returns" column
n1 = clean_out["returns"].sum() / clean_out.eval("stay_returns + returns").sum()

# repeat with the "exemptions" column
n2 = clean_out["exemptions"].sum() / clean_out.eval("exemptions + stay_exemptions").sum()

In the end the two numbers are 6.59% using the returns data and 5.74% when using the exemptions.

Other numbers reported in the literature on labor mobility are:

References

Coen-Pirani, D. (2010). Understanding gross worker flows across U.S. states. Journal of Monetary Economics, 57(7), 769–784. http://doi.org/10.1016/j.jmoneco.2010.08.001

Kambourov, G., & Manovskii, I. (2008). Occupational Mobility and Wage Inequality, Second Version. SSRN Electronic Journal. http://economics.sas.upenn.edu/~manovski/papers/occ_mob_and_wage_ineq.pdf