Compilation Methods

Compilation of Acreage Data from the US Farm Service Agency

library(tidyverse)
library(readxl)

This vignette documents the compilation of the Farm Service Agency data in to an R package. The goal is to maintain the data integrity, enhance comparability of the data over time, and reduce file size and read speed. Additional data validation and cleaning documented in the next vignette.

source("../R/track_dataset.R")

The early years, 2009 and 2010

Irrigation practice information is not included in the FSA data for these years. Also, some fields were labelled inconsistently. Columns “Irrigation Practice” and “State County Code” are created here for forward compatibility. In these tables, it appears that there is a row for each intended use of each crop on each farm. There are then columns for the planted acres, prevented acres, and failed acres. The tables are pivoted so that there is a row for each type (planted, prevented, and failed) of acreage for each intended use of each crop on each farm. Then, rows with acreage values of 0 are removed.

A Transition Year, 2011

The FSA acreage data for 2011 includes irrigation practice, but is formatted differently from the subsequent years. Now, in addition to acreage types of Planted, Prevented, and Failed, there are additional types of Volunteer, Plant and Fail, and Not Planted. Further, each of these types is either irrigated or not, so there are many types of acreage. Also, there are columns for total acreage values: Planted Total, Prevented Total, Failed Total, Volunteer Total, Not Planted Total, and Plant and Fail Total. The values in the total columns are equal to the sum of the irrigated and nonirrigated acreage for each type.

The type columns are pivoted to rows, and the zero values are removed. The nonirrigated acreage values are unnecessary (because the irrigated acreage and the total acreage provide that information).

State Names and Abbreviations

There is some variation in the spelling or capitalization of the State names and abbreviations between the 2009-2010 data and the 2011 data. Fortunately, there are numeric codes for the states, and so the codes are used to standardize the State names and abbreviations.

#### Create State_table to help clean-up
## variable spellings of states or abbreviations
State_table <-
  full_join(
    select(fsa11b, `State Code`, `State Name`) |> unique(),
    select(fsa0, `State Code`, `State Abbrev`) |> unique()) |>
  mutate(`State Name`=if_else(`State Abbrev`=='MP', 'MARIANA ISLANDS', `State Name`))

fsa1 <- bind_rows(fsa0, fsa11b) |>
  select(-`State Name`, -`State Abbrev`) |>
  left_join(State_table) |> track_dataset('fsa1')
## 542,694 obs. of 15 variables. 65,605,208 bytes

# all.equal(fsa1[,c('State Code', 'State Name', 'State Abbrev')] |> unique(),
#           State_table) ## TRUE
## This checks out...

rm(fsa0, fsa11, fsa11b)

2012-present: consistent files

The remaining files are more uniform.

read_fsa <- function(x) {
  x <- paste0('../data-raw/downloaded excel files/', x)
  read_xlsx(x, sheet='county_data') |>
    select(-`Planted and Failed Acres`) |>
    filter(!is.na(County))}

fsa2 <- bind_rows(
  `2012`=read_fsa('2012_fsa_acres_jan_2013.xlsx') |>
    rename(`Failed Acres`='Failded Acres') |> track_dataset('2012'),
  `2013`=read_fsa('2013_fsa_acres_jan_2014.xlsx') |> track_dataset('2013'),
  `2014`=read_fsa('2014_fsa_acres_jan2014.xlsx') |> track_dataset('2014'),
  `2015`=read_fsa('2015_fsa_acres_01052016.xlsx') |> track_dataset('2015'),
  `2016`=read_fsa('2016_fsa_acres_010417.xlsx') |> track_dataset('2016'),
  `2017`=read_fsa('2017_fsa_acres_010418.xlsx') |> track_dataset('2017'),
  `2018`=read_fsa('2018_fsa_acres_012819.xlsx') |> track_dataset('2018'),
  `2019`=read_fsa('2019_fsa_acres_web_010220.xlsx') |> track_dataset('2019'),
  `2020`=read_fsa('2020_fsa_acres_web_010521.xlsx') |> track_dataset('2020'),
  `2021`=read_fsa('2021_fsa_acres_web_010322.xlsx') |> 
    track_dataset('2021') |>
    slice(-1) |>
    mutate(`Planted Acres` = as.numeric(`Planted Acres`),
           `Volunteer Acres` = as.numeric(`Volunteer Acres`),
           `Failed Acres` = as.numeric(`Failed Acres`),
           `Prevented Acres` = as.numeric(`Prevented Acres`),
           `Not Planted Acres` = as.numeric(`Not Planted Acres`)),
  `2022`=read_fsa('2022_fsa_acres_web_010323.xlsx') |> track_dataset('2022') |>
    dplyr::mutate(`State County Code` = as.character(`State County Code`)),
  `2023`=read_fsa('2023_fsa_acres_web_010224.xlsx') |> track_dataset('2023') |>
    dplyr::mutate(`State County Code` = as.character(`State County Code`)),
  .id='Year')  |> 
  track_dataset('fsa2012-2023') 
## 1,790,332 obs of 16 variables. 158,999,384 bytes

## calculate Total acres, make tidy
fsa2b <- fsa2 |>
  gather('Type', 'Acres', `Planted Acres`:`Not Planted Acres`) |> ## 6,189,540 obs. of 13 variables. 644,259,288 bytes
  track_dataset('fsa2 gather acreage types') |>
  group_by(`State Code`, County, `County Code`, `State County Code`, Year,
                  `Crop`, `Crop Type`, `Crop Code`, `Intended Use`, `Irrigation Practice`, `Type`) |>
  summarise(Acres=sum(Acres, na.rm=TRUE)) |>
  ungroup() |>
  track_dataset('fsa2 sum acres by county, crop, use, irrigation, and type') |>
  spread(`Irrigation Practice`, Acres) |>
  mutate(Type = gsub("\\s*\\w*$", "", Type),
         I= if_else(is.na(I), 0, I),
         N= if_else(is.na(N), 0, N),
         O= if_else(is.na(O), 0, O)) |>
  mutate(`T`=I+N+O) |> select(-N) |>
  track_dataset('fsa2 spread irrigation practice, calculate total, remove unirrigated') |>
  gather(`Irrigation Practice`, 'Acres', `I`:`T`) |>
  track_dataset('fsa2 gather irrigation practice') |>
  filter(Acres != 0 & !is.na(Acres)) |> ## 1522162
  track_dataset("fsa2 remove 0's and missing data") |>
  rename(`County Name`=County, `Crop Type Name`='Crop Type', `Crop Name`=Crop)

State_table[55,] <- list("60","AMERICAN SAMOA", "AS")
fsa2b <- inner_join(fsa2b, State_table) |> track_dataset('fsa2 state join')

The types of acreage values, such as Planted Acres, Failed Acres, etc., which are listed in separate columns in the raw tables, are separated in to distinct rows. Then the acreage values are summed for each year, county, crop, crop type, intended use, irrigation practice, and type. Prior to that operation, each row represented a specific (though unnamed) farm, now the rows represent the totals for the county. The rows for nonirrigated acreage are removed, because that information is implicit in the values for irrigated and total acreage.

fsa <- bind_rows(fsa1, fsa2b) |> track_dataset('fsa')
saveRDS(fsa, '../data-raw/acresFSA_compilation.rds')
rm(fsa1, fsa2, fsa2b)

write.csv(data_manipulation_log, '../data-raw/data_compilation_log.csv')

### TODO: Read in farm counts
# read_farm_count <- function(x) {read_xlsx(x, sheet='farm_count')}

C A Pellett

2024-12-17

Compilation of Acreage Data from the US Farm Service Agency

The early years, 2009 and 2010

A Transition Year, 2011

State Names and Abbreviations

2012-present: consistent files