Worker Flows

For a research project I wanted to answer the question “What percentage of people move each year?” where “move” is defined as change industry or change location. I spent some time looking for data to help me answer the question and found three main resources:

The current population survey (CPS) from the US Census Bureau. The CPS reports education, demographic, geographic, and employment data for about 60,000 individuals each month. Each participant responds for four consecutive months, then waits for eight months without responding, then does one final set of four consecutive months of responses.
The County Business Patterns (CBP), also from the census. The CBP annual reports industry composition information by county. The data includes the number of establishments in each industry as well as bins of total employment in that industry – meaning number of employees is reported in groups like “1 to 5”, “5 to 10, “10 to 20”, etc.
The IRS migration data. This dataset uses the address and reported income on individual tax fillings to track how many individuals move in or out of a county. For inflows, the source county is reported. For outflows the destination county is reported. Each observation is a year, county in, county out, number of filings, total number of exemptions in those filings, and adjusted gross income of the cohort. The IRS claims the data represents 98% of all filings from 1990 to 2014.

3 Strategies

I came up with three strategies for fulfilling the want operator (answering the question): use the CPS, merge the CBP with the IRS data to track industry and geographic movement, just use the IRS to track geographic movement.

I wanted to start with the CPS because I really care about how likely an individual is to move and this is the only micro (individual-level) data set I had. However, I had a few issues:

The sample size is not that large
The industry reported by the CPS is very noisy. In the sample I looked at (all the data from 1990-2016 – collected from IPUMS) the industry column was missing for more than ¹⁄₂ of the observations.

Then I thought I could somehow merge the CBP and the IRS data thinking that the CBP would help me track industry mobility and the IRS would let me track geographic mobility. It turns out that merging two aggregate data sets is difficult. I didn’t spend too much time on it, but I couldn’t figure out how to make the inflow/outflow data from the IRS as well as the previous year/current year industry employee bins to the number of people who either moved or changed industries. Perhaps if the CBP reported numbers instead of bins I could have done it, but it wasn’t obvious to me how to work with the bins.

I eventually fell back on the third option, which was to just use the IRS data and forget about changing industries. This is equivalent to assuming that a worker will never switch industries without also switching the county of primary residence. While not exactly what I was going for, this doesn’t seem like too strong of an assumption. The final statistic I report will under-estimate the total fraction of people who “move” under my definition from above.

Using the IRS data

I wrote some Python code (forthcoming as of 2017-05-23 – once I find the right home for it I will release it) that will download and clean all the IRS migration data. The code takes about 3 minutes to run and produces two files inflows.feather and outflows.feather. These files store the data in a binary format called the feather format meant to be an efficient and cross-language means of serializing tabular data.

The end result of my analysis should be a single number representing the fraction of people who move each year. In order to compute this I used the code below.

Breaking form from the previous posts on this site, I will interleave code and output so that it is clear what is in the data

# import libraries
import feather
import pandas as pd

# read in outflows data
outflows = feather.read_dataframe("outflows.feather")
outflows.head()

	state_from	county_from	state_to	county_to	returns	exemptions	agi	year1
0	2	13	96	0	84	173	NaN	1990
1	2	13	2	20	12	24	14.29	1990
2	2	13	58	0	21	45	25.00	1990
3	2	13	59	0	41	88	48.81	1990
4	2	13	59	9	10	16	11.90	1990

fips_cz = feather.read_dataframe("fips2cz.feather")
fips_cz[["FIPS", "County Name"]].head()

	FIPS	County Name
0	1001	Autauga County
1	1003	Baldwin County
2	1005	Barbour County
3	1007	Bibb County
4	1009	Blount County

# Want to change the fips column into state and county code
# pad the FIPS column with leading 0's to make sure it is 5 digits.
# then the first two are the state and last three are the county
fips_code = fips_cz.FIPS.astype(str).str.zfill(5)
fips_cz["state"] = fips_code.str[:2].astype(int)
fips_cz["county"] = fips_code.str[2:].astype(int)

# also get a list of states so we can filter the inflows and outflows
# dataframes
states = fips_cz["state"].unique()
fips_cz[["FIPS", "County Name", "state", "county"]].head()

	FIPS	County Name	state	county
0	1001	Autauga County	1	1
1	1003	Baldwin County	1	3
2	1005	Barbour County	1	5
3	1007	Bibb County	1	7
4	1009	Blount County	1	9

We need to make two adjustments to the raw outflows DataFrame:

We need to restrict the state_from column to be one of the US states so we don’t count things like the US itself, foreign countries, etc.
Sometimes the returns column is a negative number. This is a flag that denotes the entry was suppressed for that column. The reason suppressing data is to preserve the confidentiality of individuals. The documentation says the following:

At the county level only, certain matched tax returns that represented a specified percentage of the total of any particular cell have been excluded. For example, if one return represented 75 percent of the value of a given cell, the return was suppressed from the county detail. The actual threshold percentage used cannot be released.

# restrict outflows to be flows from a state
out_states = outflows[outflows.state_from.isin(states)]

# filter out negative numbers
out_states = out_states[(out_states.returns > 0) & (out_states.exemptions > 0)]

Now, this dataset has aggregated rows representing total outflow from one county to all destinations. These are coded as having state code 96 and county code 0. The dataset also includes observations where the from and to state/county codes are the same. This represents the pool of people who didn’t move from that county in a given year. For all the starting counties and years, these are the only two rows we need to compute our desired statistic.

Our strategy will be as follows:

Group by the state and county from codes
For each group, extract aggregate outflow and total non-movers rows for each year.
Set the index of the data frame to be the year – this will ensure data is aligned properly for the next step
Stitch the two sub-dataframes together horizontally, so instead of two two column dataframes we end up with one four column dataframe
Store this in a list
Stitch together the list of dataframes vertically, so we end up with a N_county * N_year by 4 dataframe.

Here’s the code

all_mover_out_query = "state_to == 96 and county_to == 0"
gb_out = out_states.groupby(["state_from", "county_from"])

res_out = []
for name, group in gb_out:
    st, co = name
    all_movers = group.query(all_mover_out_query)[cols].set_index("year1")
    non_mover_query = f"state_to == {st} and county_to == {co}"
    non_movers = group.query(non_mover_query)[cols].set_index("year1")
    non_movers = non_movers.add_prefix("stay_")
    how_many = pd.concat([all_movers, non_movers], axis=1)
    how_many.reset_index(inplace=True)
    how_many["state"] = st
    how_many["county"] = co
    res_out.append(how_many)

final_out = pd.concat(res_out)
final_out.set_index(["state", "county", "year1"], inplace=True)

Now we can compute the main statistic we are after.

# some years didn't have observations -- drop them
clean_out = final_out.dropna()

# compute the statistic using the "returns" column
n1 = clean_out["returns"].sum() / clean_out.eval("stay_returns + returns").sum()

# repeat with the "exemptions" column
n2 = clean_out["exemptions"].sum() / clean_out.eval("exemptions + stay_exemptions").sum()

In the end the two numbers are 6.59% using the returns data and 5.74% when using the exemptions.

Other numbers reported in the literature on labor mobility are:

Coen-Pirani (2010) estimates that 16.3% of workers move between states in a five year period
Kambourov and Manovskii (2008) use the PSID and report that at the 3 digit NAICS level (approximately 400 groups), about 20% of workers switch jobs annually. This number seems quite high to me, but I haven’t taken the time to study the paper carefully enough to know if it is accurate or a feature of their data.

References

Coen-Pirani, D. (2010). Understanding gross worker flows across U.S. states. Journal of Monetary Economics, 57(7), 769–784. http://doi.org/10.1016/j.jmoneco.2010.08.001

Kambourov, G., & Manovskii, I. (2008). Occupational Mobility and Wage Inequality, Second Version. SSRN Electronic Journal. http://economics.sas.upenn.edu/~manovski/papers/occ_mob_and_wage_ineq.pdf

Stream of (economic) conciousness

Worker Flows

3 Strategies

Using the IRS data

References