Hi, I’m trying to merge 2 shape files from the year 1850 (from NHGIS). One has agricultural data and the other has demographic data. So far these 2 are easily read through read_nhgis_sf
but at the time I want to merge them into a single file I have some problems.
I tried with st_join
from the sf
package to merge the 2 shape files into one, but the result is a colossal file of 10k rows, with quadruplicated values or more (the original files only had around 1650 rows).
The closest to a solution that I’ve reached was to read individually only the datasets thorugh read_nhgis
, then merge those datasets with base R merge
and then join the resulting dataset with a shape file with ipums_shape_inner_join
. So far this is giving me an acceptable shape file but all the metadata and description of variables is lost. And with 60 different values being able to read them straight from Rstudio comes really handy instead of having to open the codebook and seeing which one is which.
Is there any solution to keep the metadata in the shapefile or maybe to add it later? Thank you
Data from NHGIS comes in three different formats: Source Tables (Census data that have been aggregated to various geographic levels), Time Series Tables, and GIS Files (Shapefiles of Tiger Line Files at various geographic levels). I’m guessing that you are trying to join two source tables (one containing agricultural data and another containing demographic data) to a single shapefile , please correct me if I’m wrong. The read_nhgis_sf function completes the join for you once you enter the arguments for both datasets using the data_file (source table) and shape_file (shapefile); these exercises, Ex 1 and Ex 2, will help you practice working with NHGIS data in R. Alternatively, if you are working with the data in a GIS platform such as ESRI ArcGIS, source tables can be joined with GIS files using the GISJOIN variable (this tutorial outlines how to do this). If you provide more specific information on the tables and shapefile you are having difficulty joining I will be happy to offer more assistance.
Hi, sorry for missing info. My data is actually using 2 kinds of source tables, one is for population data:
Dataset: 1850 Census: Population, Agriculture & Other Data [US, States & Counties]
NHGIS code: 1850_cPAX
NHGIS ID: ds10)
(the one used in the first exercise) and another is agricultural data:
Year: 1850
Geographic level: County (by State)
Dataset: 1850 Census: Agriculture Data [US, States & Counties]
NHGIS code: 1850_cAg
NHGIS ID: ds9
What I have so far is one code for reading that population source table through data_layer
like this csv1850a <- read_nhgis(data_file = pathcsv, data_layer=contains("ds9"))
and another for reading agricultural data like this csv1850b <- read_nhgis(data_file = pathcsv, data_layer=contains("ds10"))
. Then I use a base R merge to combine them into a single 1850 dataset like this csv1850 <- merge(csv1850a,csv1850b)
and then I join them to the gis file like this data1850 <- ipums_shape_inner_join(data = csv1850,shape_data = shape1850, by = "GISJOIN")
. This works well and eventually I get a shape file in which I have both gis data and 2 source tables together, though the problem in this is that the description of the variables is missing.
I’m trying to do what you’re suggesting me with the nhgis_sf
with this code bisdata1850 <- read_nhgis_sf(data_file = pathcsv, data_layer=contains("ds9"), data_layer=contains("ds10"), shape_file = pathshp, shape_layer = contains("1850") )
, though it’s telling me that data layer has multiple arguments Error in read_nhgis_sf(data_file = pathcsv, data_layer = contains("ds9"), : el argumento formal "data_layer" concuerda con múltiples argumentos especificados
(sorry I can’t figure out how to set the R errors in english instead of spanish). And I don’t know what else to do. I can settle for data without variable descriptions but maybe there’s a way to combine all source tables and keeping the descriptions and also being able to join then to a gis shape file. Thank you for your answer and hope there is enough info.
It sounds like you are trying to preserve the second row of the csv (first screenshot below), which is added to the spreadsheet by selecting “Include additional descriptive header row” in the Review and Submit section of the data extract process (second screenshot below), in the variable names after joining the data to the shapefile, please correct me if I am wrong. Joins using GIS can only be done with a single header due to the nature of how the software completes the join; it is not possible to preserve the second header. The best workaround for you might be to revise the first header in the CSV to be more descriptive before joining them to the shapefile. If this is not the problem you are describing, can you please provide a screenshot of your data output and describe in more detail what is missing?
Hello again, sorry for the delayed responses. My problem is more about R (or Rstudio, which is the IDE I’m using for this) bc I’m trying to polish my R skills more than start learning about GIS or Excel itself. My problem is not that I want to preserve the second row of the csv, in fact the csv I’m using doesn’t have a descriptive second row because with the “magic” of ipumsr
you can import the NHGIS data really easy and simple plus giving you automatically the descriptive names. The thing is that while I’m able to upload single source tables just fine with all the descriptive names
and shape files with that single source table when it comes to merge 2 source tables of the same years those descriptive names disappear I haven’t found any way to merge 2 single source files through
ipumsr
that perhaps keeps those descriptive names, so I just use base R to do the merge. When I join the merged dataset to a shape file with an
ipumsr
function I don’t regain those descriptive names either and if I try to do the other way around, joining 2 shape+single source table together (using
sf
functions because I can’t find any
ipumsr
function to do that) I get this colossal sf object with many repeated values that at least keeps the descriptive names so honestly I’m a bit lost on what to do bc I don’t have much experience with spatial data. I wouldn’t mind to use just the variable names without description, though it’s a bit annoying to remember that for example “ADL001” means “value of livestock”. Thank you again for your answer and I hope I shared all the info you needed.
Edit: sorry the webpage is telling me that as a new user I can only attach a single image so if you want more I’l send them later
Thank you for providing additional information, I think I understand the problem now. As you stated, the variable descriptions are lost when you execute the merge function. The left_join() function within the dplyr package will retain variable descriptions in R; be sure to include the following argument: by = “GISJOIN”.
Hello, that worked indeed, and it kept the variable descriptions in the source table, but when I try to use ipums_shape_inner_join()
(or any kind of the shape join family) I get the error Error:All columns in a tibble must be vectors. x Column geometry is a sfc_MULTIPOLYGON/sfc object.
and honestly I don’t know what else to do bc I tried to convert the merged source file into a dataframe with as.data.frame(csv1850)
and I keep having the same error. Thank you for your answer, I hope there’s some solution for this problem too.
I’m glad the join worked for you to keep variable descriptions in the table. I think the error message you are getting could be related to either not having the necessary library loaded or having an older version of CRAN installed on your machine (see this stack overflow thread). Trying running ‘library(sf)’ to see if that fixes the problem. If that doesn’t work, a good way to troubleshoot errors when programming is to copy and paste the error message into google; usually you will find that someone else has had the same problem and posted a solution.
Ohhh it finally worked! Thanks a lot, it seems that I needed to load sf
. Thanks a lot for your help now this is just out of curiosity but is there any reason why it works using tidyverse
functions but not with base R? Maybe the descriptive names are only supported by tidyverse
?
1 Like
That’s a great question! I can’t be sure, but it might be possible to view descriptive names of variables using base R; I just found a way to do it using a tidyverse function. It may have been easier to do it this way because tidyverse functions work with a dataframe as a tibble, which is a modern version of a data frame that allows for non-standard variable names. This vignette, which compares dplyr functions to their base R equivalents, might shed some light on the difference between working with data in the tidyverse compared with in base R.