Fare Accessibility Model Report: VBB, Stop Level

Aim and structure of this report

This report is a supplement to my PhD thesis on fare accessibility in the transport associations of Hamburg Region (HVV) and Berlin/Brandenburg (VBB). The German thesis fulltext can be downloaded via DOI:10.15480/882.13161. An English summary of my thesis can be found in the Fare Accessibility Dashboard via DOI:10.15480/882.13164. The report describes results at a stop level. Reports for the other two levels (municipality and 500m grid) can be downloaded via DOI:10.15480/882.13162.

The report presents the estimates of four regression models that were applied to a VBB dataset. The independent variables are listed in the first column of the summary table. The dependent variable is fare accessibility (fare_acc). The aim is to predict the variability of fare_acc based on the independent variables.

After giving an overview of the input variables, this report describes the models and delivers test results that describe the data quality and support the data generation process: Moran’s I to test for autocorrelation, Likelihood Ratio tests to compare the goodness of fit, a Breusch-Pagan test for heteroskedasticity, and a Spatial Hausman test for model suitability. Furthermore, 10+1 hypotheses are tested based on the regression estimates.

Please note that this report is a technical supplement to my PhD thesis. For a more abstract interpretation of the results and comparison of different spatial resolutions, please refer to the result chapter of my thesis.

The methodology was inspired by Mark Burkey, who is an economics professor at North Carolina A&T State University. Mark has produced a number of videos that are very informative for students and researchers of spatial econometrics, for example:

##### PREPARE PLOT INPUT #####
# Reordering the levels of the 'group' variable to match the desired order and assigning new labels
bplot_import$group <- factor(bplot_import$group,
                       levels = c("fare_acc", "pt_index", 
                                  "purch_power", "rent", "travel_time", "pop_dens", "dist_centre", "cars_cap", "dwelling_area"))

bplot_fill_colors <- c("fare_acc" = "#178072", "pt_index" = "#62988D", "travel_time" = "grey80", "dwelling_area" = "grey80",
                       "cars_cap" = "grey80", "dist_centre" = "grey80", 
                       "pop_dens" = "grey80", "purch_power" = "#a4aa4d", "rent" = "#933890")

    bplot_fill_alpha <- .5 
    bplot_outline_size <- c(1, .5, .5, .5, .5, .5, .5, .5, .5)
    bplot_median_size <- .2

# define text modules
     bplot_xlab<-"Variable"
      bplot_ylab<-"Standard deviation"
      bplot_dec_sep="."
      bplot_big_sep = ","

      

##### WRITE A FUNCTION FOR THE PLOT #####
      
create_violin_boxplot <- function(pinput, bplot_fill_alpha, bplot_fill_colors, group_name, bplot_ylab, bplot_median_size) {
  # Check if the specified group_name exists in the dataset
  if (!(group_name %in% unique(pinput$group))) {
    stop("The specified group_name is not present in the dataset.")
  }

  # Calculate sample size for the specified group
  group_sample_size <- pinput %>%
    filter(group == group_name) %>%
    nrow()  # Number of rows for the specified group

  # Create the ggplot
  p <- ggplot(pinput, aes(y = sd, x = group, fill = group, color = group)) +
    geom_hline(yintercept = 0, linewidth = .4, linetype = "dotted", color = "grey50") +  # Add horizontal line at y = 0
    # Violin plot
    geom_violin(aes(linewidth = ifelse(group == "fare_acc", "thick", "thin")),  # Conditional mapping for outline size
                scale = "width", alpha = bplot_fill_alpha) + 
    scale_fill_manual(values = bplot_fill_colors, guide = FALSE) +
    scale_color_manual(values = bplot_fill_colors, guide = FALSE) +
    scale_linewidth_manual(values = c(thick = 1, thin = 0.5), guide = "none") +  # Disable legend for thick/thin
    # Boxplot
    geom_boxplot(width = 0.1, outlier.size = .1, color = 'black', fill = "black", alpha = 0) +  # Boxplot
    # Update x-axis label to include sample size of the specified group
    scale_x_discrete(name = paste0("Variable<br>(n = ", formatC(group_sample_size, big.mark=","), " ", lvl_en_p, ")")) +
    scale_y_continuous(name = bplot_ylab)+
    # theme options
    theme_light() +
    theme(panel.grid = element_blank(),
          axis.text.x = element_text(angle = 45, hjust = 1),  # Adjust angle and justification of x-axis labels
          axis.ticks.x = element_blank(),
          axis.text = element_text(color = 'black'),
          panel.background = element_rect(fill = "white"))# Set the background color of the plot area to white
    
  # Convert to an interactive plotly plot
  p_interactive <- ggplotly(p)
  
  # Disable tooltips for all traces
  p_interactive <- p_interactive %>% 
    style(hoverinfo = "none", text = NULL)
  
  # Return the interactive plotly plot
  return(p_interactive)
}


##### CALCULALTE PLOTS FOR DIFFERENT SAMPLES #####
# create a plot that includes outliers
bplot_outliers <- create_violin_boxplot(pinput = bplot_import, bplot_fill_alpha = bplot_fill_alpha, bplot_fill_colors = bplot_fill_colors, bplot_ylab = bplot_ylab, bplot_median_size = bplot_median_size, group_name = "pt_index")

# create a plot that is stripped of outliers
bplot <- create_violin_boxplot(pinput = subset(bplot_import, outlier == 0), bplot_fill_alpha = bplot_fill_alpha, bplot_fill_colors = bplot_fill_colors, bplot_ylab = bplot_ylab, bplot_median_size = bplot_median_size, group_name = "pt_index")

# create a plot that is stripped of outliers - URBAN ONLY
bplot_u <- create_violin_boxplot(pinput = subset(bplot_import, outlier_u == 0 & raumt_kl == 1), bplot_fill_alpha = bplot_fill_alpha, bplot_fill_colors = bplot_fill_colors, bplot_ylab = bplot_ylab, bplot_median_size = bplot_median_size, group_name = "pt_index")

# create a plot that is stripped of outliers - RURAL ONLY
bplot_r <- create_violin_boxplot(pinput = subset(bplot_import, outlier_r == 0 & raumt_kl == 0), bplot_fill_alpha = bplot_fill_alpha, bplot_fill_colors = bplot_fill_colors, bplot_ylab = bplot_ylab, bplot_median_size = bplot_median_size, group_name = "pt_index")

# plots will be called inline below!

Overview of input variables

Whole dataset, including outliers

As the plot below shows, the input variables carry lots of outliers.

The plots below are interactive. The colours match the map colours in the result section of my thesis. Zoom and pan to explore the data! You can show and hide the violin plots by clicking the legend items.

Stripped of outliers

The plot below is corrected for outliers beyond ± 1.5 IQR for any of the input variables.

Stripped of outliers – aggregate table

These are aggregates of the variables that were used for the (auto-)regressive models. The sample is stripped of outliers (see boxplot above) and contains urban as well as rural stops.

# Create a kableExtra table and set styles for all columns
agg_table_styled <- kable(
  agg_table_import,
  format = "html",
  escape = FALSE
) %>%
  kable_styling(full_width = FALSE) %>%  # Set to TRUE if you want full-width
  column_spec(1, bold = TRUE, extra_css = "text-align: right;")  # Make first column bold and align right

# Print or use the modified HTML table
print(agg_table_styled)

variable_en	unit_en	mean	median	sd	min	max	source_en	yr
Fare accessibility on a €2.30 budget	10 Standard deviations	0.00	-0.01	0.07	-0.20	0.20	Own calculation, see method section	12/2018
Average per capita purchasing power	€10,000/year	2.04	1.97	0.37	1.22	3.09	infas360	12/2018
Average rent	€/sqm/month, no utilities	6.63	6.25	1.29	4.13	11.00	infas360	12/2018
Private cars per capita	Count, approximated from street block	0.51	0.51	0.12	0.13	0.88	infas360	2018/19
Distance to the nearest urban centre	km, radial	3.64	2.62	2.97	0.27	12.35	Regional plans	12/2018
Public transport index	[no unit]	4.59	4.52	2.02	0.00	11.12	Own calculation, see method section	2018/19
Travel time to the next destination	10 minutes, weighted across 15 categories similar to fare accessibility, see method section	2.08	1.86	1.01	0.32	5.10	HVV API (“GEOFOX”)	12/2018
Population density	100 persons/ha	0.35	0.22	0.32	0.02	1.41	infas360	12/2018
Average dwelling area per capita	10sqm	5.71	5.80	1.03	2.50	9.00	infas360	12/2018

Stripped of outliers – urban stops only

The plot below shows corrected data for urban areas (as per the definition in the method section of my PhD thesis).

Stripped of outliers – rural stops only

The plot below shows corrected data for rural areas.

Comparison of four model estimates

This report presents four models at the stop level:

Spatial Durbin Error Model (SDEM)
Spatial Error Model (SEM)
Spatial Lag X Model (SLX)
Ordinary Least Squares Model (OLS)

SDEM and SLX assume a spatially lagged distribution of the residuals of the independent variable, while the other two do not assume a lag.X. The lag.X columns of SEM and OLS are therefore empty. Please note: For this table, fare accessibility was multiplied by 10 in order to obtain readable coefficients. I give an example in the Interpretation section below the table.

# #the parameter "results ='asis'" is crucial. Otherwise, the output would be raw html code

##### Calculate Pseudo-R^2^ for SDEM and SEM to list them in the table
# Note that the formula points to the dynamic r variable, i.e., oevx/t2/whatever I assign as the dependent variable in the file 1_make_models.R! 

#SDEM
psr2_sdem <- 1 - (sdem$SSE / (var(stops[[regr]]) * (length(stops[[regr]]) - 1)))
psr2_sdem <- round(psr2_sdem,2)


##### SEM
psr2_sem <- 1 - (sem$SSE / (var(stops[[regr]]) * (length(stops[[regr]]) - 1)))
psr2_sem <- round(psr2_sem,2)




# Generate the htmlreg table with custom rows for Pseudo-R^2^ 
model_table <- texreg::htmlreg(
  list(sdem, sem, slx, ols),
  doctype = FALSE,
  custom.model.names = model_names,
  custom.gof.rows = list('Pseudo-R^2^' = c(psr2_sdem, psr2_sem, '', '') # this one inserts two values and two empty strings into the table
  ),
  reorder.gof = c(2:10, 1, 11:19), # reorder the R^2^ rows, i.e., insert row number 1  in-between row 10 (last R^2^ of the regular models) and row 11.
  caption = paste0("Multivariate regression estimates of the four models<br>
  Dependent variable: ", dv_en, " (", dv, ")<br>
  The sample is stripped of outliers [(see boxplot above)](#bplot)."),
  custom.note = cap_sig)


# Create a kableExtra table and set styles for all columns
model_table_styled <- kable(
  model_table,
  format = "html",
  escape = FALSE
) %>%
  kable_styling(full_width = FALSE) %>%  # Set to TRUE if you want full-width
  column_spec(1, extra_css = "text-align: right;")  # Align all columns to the right

# Print or use the modified HTML table
print(model_table_styled)

x

Multivariate regression estimates of the four models
Dependent variable: fare accessibility (fare_acc)
The sample is stripped of outliers (see boxplot above).
	SDEM	SEM	SLX	OLS
(Intercept)	-4.43^***	-6.44^***	-6.28^***	-6.96^***
	(1.12)	(0.96)	(1.16)	(1.10)
purch_power	1.50^***	1.28^***	1.90^**	3.81^***
	(0.24)	(0.24)	(0.68)	(0.28)
rent	0.42^***	0.61^***	0.52	0.00
	(0.12)	(0.11)	(0.30)	(0.09)
cars_cap	2.24^***	1.35^*	2.66	8.34^***
	(0.65)	(0.57)	(1.58)	(0.93)
dist_centre	-0.06	0.09	-0.11	0.33^***
	(0.06)	(0.05)	(0.17)	(0.03)
pt_index	0.05	0.06	0.03	-0.01
	(0.03)	(0.03)	(0.09)	(0.05)
travel_time	-0.53^***	-0.28^***	-0.59^***	-1.31^***
	(0.08)	(0.06)	(0.17)	(0.11)
pop_dens	1.37^***	1.40^***	3.40^***	1.86^***
	(0.36)	(0.33)	(0.92)	(0.44)
dwelling_area	-0.28^***	-0.22^**	-0.20	-0.70^***
	(0.08)	(0.08)	(0.21)	(0.11)
lag.purch_power	2.29^***		2.17^**
	(0.39)		(0.73)
lag.rent	-0.60^***		-0.62^*
	(0.15)		(0.31)
lag.cars_cap	2.30^*		7.85^***
	(0.97)		(1.74)
lag.dist_centre	0.36^***		0.45^**
	(0.08)		(0.17)
lag.pt_index	0.03		-0.09
	(0.06)		(0.11)
lag.travel_time	-0.58^***		-1.11^***
	(0.11)		(0.18)
lag.pop_dens	0.12		-1.70
	(0.58)		(1.02)
lag.dwelling_area	-0.40^***		-0.60^**
	(0.11)		(0.23)
lambda	0.85^***	0.85^***
	(0.00)	(0.00)
Num. obs.	7581	7581		7581
Parameters	19	11
Log Likelihood	-19057.91	-19106.85	-25036.81
AIC (Linear model)	50109.62	50180.62
AIC (Spatial model)	38153.83	38235.71
LR test: statistic	11957.79	11946.91
LR test: p-value	0.00	0.00
R²			0.13	0.12
Adj. R²			0.13	0.12
Pseudo-R²	0.89	0.88
Sigma			6.58
Statistic			69.14
P Value			0.00
DF			16.00
AIC			50109.62
BIC			50234.42
Deviance			327967.68
DF Resid.			7564
nobs			7581
* p < 0.05; p < 0.01; * p < 0.001

Interpretation of the estimates

The statistical effects of each model are interpreted below using purchasing power as an example. The other independent variables can be interpreted in the same way. The following levels of significance are used:

Significance levels
	p
*	< 0.05
**	< 0.01
***	< 0.001

# #download and launch plugin to insert a table easily (available when using the "Visual" mode on the upper left)
# install.packages("devtools")
# devtools::install_github("lbusett/insert_table")


# Extract the p-value and beta coefficient for the variable of interest

#SDEM
  sdem_p_kk <- coef(summary(sdem))["purch_power", "Pr(>|z|)"]
  sdem_p_kk_lag <- coef(summary(sdem))["lag.purch_power", "Pr(>|z|)"]
  
  # extract beta
  sdem_kk <- round(sdem[["coefficients"]][["purch_power"]],2)
  
  # round beta
  if (abs(sdem_kk) < 1) {
    sdem_kk_r <- round(sdem_kk, 2)  # round to 2 decimal places (0.06)
  } else {
    sdem_kk_r <- round(sdem_kk, 0)  # round to nearest integer
  }
  
  
   # extract beta (lag)
  sdem_kk_lag <- round(sdem[["coefficients"]][["lag.purch_power"]],2)
  
  # round beta (lag)
  if (abs(sdem_kk_lag) < 1) {
    sdem_kk_lag_r <- round(sdem_kk_lag, 2)  # round to 2 decimal places (0.06)
  } else {
    sdem_kk_lag_r <- round(sdem_kk_lag, 0)  # round to nearest integer
  }
  
  

#SEM
  sem_p_kk <- coef(summary(sem))["purch_power", "Pr(>|z|)"]
  
  # extract beta
  sem_kk <- round(sem[["coefficients"]][["purch_power"]],2)
  
  # round beta
  if (abs(sem_kk) < 1) {
    sem_kk_r <- round(sem_kk, 2)  # round to 2 decimal places (0.06)
  } else {
    sem_kk_r <- round(sem_kk, 0)  # round to nearest integer
  }
  

#SLX
  slx_p_kk <- coef(summary(slx))["purch_power", "Pr(>|t|)"]

  # extract beta
  slx_kk <- round(slx[["coefficients"]][["purch_power"]],2)
  
  # round beta
  if (abs(slx_kk) < 1) {
    slx_kk_r <- round(slx_kk, 2)  # round to 2 decimal places (0.06)
  } else {
    slx_kk_r <- round(slx_kk, 0)  # round to nearest integer
  }

#OLS
  ols_summ <- summary(ols)
  ols_p_kk <- round(ols_summ$coefficients["purch_power", "Pr(>|t|)"],2)

  # extract beta
  ols_kk <- round(ols[["coefficients"]][["purch_power"]],2)
  
  # round beta
  if (abs(ols_kk) < 1) {
    ols_kk_r <- round(ols_kk, 2)  # round to 2 decimal places (0.06)
  } else {
    ols_kk_r <- round(ols_kk, 0)  # round to nearest integer
  }

# note that I can weave these p values into the report text: 
# "(p = `r sdem_p_kk`)"

SDEM

As the summary table shows, the purchasing power (kk) exerts an effect of 1.5 on the fare accessibility (fare_acc). This means that for every 1,000 EUR increase in purchasing power in a given stop, we should expect an increase of approximately 0.2 fare accessibility units. (*correction factor of 10 for readability of coefficients). The regression estimates are ***significant.

The SDEM also estimates spillover effects: the purchasing power (kk) exerts an indirect effect of 2.29 on the fare accessibility (fare_acc) in a neighbouring stop. This means that for every 1,000 EUR increase in purchasing power in a given stop, we should expect an increase of approximately 0.2 fare accessibility units in any neighbouring stop.

SEM

As the summary table shows, the purchasing power (kk) exerts an effect of 1.28 on the fare accessibility (fare_acc). This means that for every 1,000 EUR increase in purchasing power in a given stop, we should expect an increase of approximately 0.1 fare accessibility units. The regression estimates are ***significant.

SLX

As the summary table shows, the purchasing power (kk) exerts an effect of 1.9 on the fare accessibility (fare_acc). This means that for every 1,000 EUR increase in purchasing power in a given stop, we should expect an increase of approximately 0.2 fare accessibility units. The regression estimates are **significant.

OLS

As the summary table shows, the purchasing power (kk) exerts an effect of 3.81 on the fare accessibility (fare_acc). This means that for every 1,000 EUR increase in purchasing power in a given stop, we should expect an increase of approximately 0.4 fare accessibility units. The regression estimates are ***significant.

Comparison of four model estimates (normalized)

The table below shows the same models with normalized input variables. Each variable is standardized to Mean = 0 and Standard Deviation = 1. This allows the beta coefficients to be compared with each other even though they have different units. The interpretation is similar to the example above, but the unit is Standard Deviations.

# #the parameter "results ='asis'" is crucial. Otherwise, the output would be raw html code

##### Calculate Pseudo-R^2^ for SDEM and SEM to list them in the table
# Note that the formula points to the dynamic r variable, i.e., oevx/t2/whatever I assign as the dependent variable in the file 1_make_models.R! 

regr <- "fare_acc_zt"
dv <- regr
dv_en <- t2_en

#SDEM
psr2_sdem <- 1 - (sdem_zt$SSE / (var(stops[[regr]]) * (length(stops[[regr]]) - 1)))
psr2_sdem <- round(psr2_sdem,2)


##### SEM
psr2_sem <- 1 - (sem_zt$SSE / (var(stops[[regr]]) * (length(stops[[regr]]) - 1)))
psr2_sem <- round(psr2_sem,2)


# Generate the htmlreg table with custom rows for Pseudo-R^2^ 
model_table <- texreg::htmlreg(
  list(sdem_zt, sem_zt, slx_zt, ols_zt),
  doctype = FALSE,
  custom.model.names = model_names,
  custom.gof.rows = list('Pseudo-R^2^' = c(psr2_sdem, psr2_sem, '', '') # this one inserts two values and two empty strings into the table
  ),
  reorder.gof = c(2:10, 1, 11:19), # reorder the R^2^ rows, i.e., insert row number 1  in-between row 10 (last R^2^ of the regular models) and row 11.
  caption = paste0("Multivariate regression estimates of the four models, normalized values<br>
  Dependent variable: ", dv_en, " (", dv, ")<br>
  The sample is stripped of outliers [(see boxplot above)](#bplot)."),
  custom.note = cap_sig)


# Create a kableExtra table and set styles for all columns
model_table_styled <- kable(
  model_table,
  format = "html",
  escape = FALSE
) %>%
  kable_styling(full_width = FALSE) %>%  # Set to TRUE if you want full-width
  column_spec(1, extra_css = "text-align: right;")  # Align all columns to the right

# Print or use the modified HTML table
print(model_table_styled)

x

Multivariate regression estimates of the four models, normalized values
Dependent variable: fare accessibility (fare_acc_zt)
The sample is stripped of outliers (see boxplot above).
	SDEM	SEM	SLX	OLS
(Intercept)	-0.01	-0.02	-0.03^*	-0.02
	(0.02)	(0.02)	(0.01)	(0.01)
purch_power_zt	0.09^***	0.08^***	0.11^**	0.23^***
	(0.01)	(0.01)	(0.04)	(0.02)
rent_zt	0.06^**	0.11^***	0.02	0.00
	(0.02)	(0.02)	(0.07)	(0.02)
cars_cap_zt	0.08^***	0.04^*	0.07	0.26^***
	(0.02)	(0.02)	(0.05)	(0.03)
dist_centre_zt	-0.04	0.04	-0.06	0.14^***
	(0.03)	(0.02)	(0.08)	(0.01)
pt_index_zt	0.02	0.02	0.00	-0.00
	(0.01)	(0.01)	(0.03)	(0.02)
travel_time_zt	-0.11^***	-0.06^***	-0.13^***	-0.28^***
	(0.02)	(0.01)	(0.04)	(0.02)
pop_dens_zt	0.09^***	0.09^***	0.17^**	0.13^***
	(0.02)	(0.02)	(0.07)	(0.03)
dwelling_area_zt	-0.04^**	-0.04^**	-0.04	-0.12^***
	(0.01)	(0.01)	(0.04)	(0.02)
lag.purch_power_zt	0.14^***		0.14^**
	(0.02)		(0.04)
lag.rent_zt	-0.07^*		-0.04
	(0.03)		(0.07)
lag.cars_cap_zt	0.09^**		0.26^***
	(0.03)		(0.06)
lag.dist_centre_zt	0.17^***		0.21^**
	(0.04)		(0.08)
lag.pt_index_zt	0.02		-0.02
	(0.02)		(0.04)
lag.travel_time_zt	-0.11^***		-0.24^***
	(0.02)		(0.04)
lag.pop_dens_zt	0.05		-0.05
	(0.04)		(0.07)
lag.dwelling_area_zt	-0.05^*		-0.09^*
	(0.02)		(0.04)
lambda	0.85^***	0.85^***
	(0.00)	(0.00)
Num. obs.	7581	7581		7581
Parameters	19	11
Log Likelihood	-3598.93	-3650.03	-9581.41
AIC (Linear model)	19198.82	19270.14
AIC (Spatial model)	7235.85	7322.05
LR test: statistic	11964.96	11950.09
LR test: p-value	0.00	0.00
R²			0.13	0.12
Adj. R²			0.13	0.12
Pseudo-R²	0.89	0.88
Sigma			0.86
Statistic			69.15
P Value			0.00
DF			16.00
AIC			19198.82
BIC			19323.62
Deviance			5559.51
DF Resid.			7564
nobs			7581
* p < 0.05; p < 0.01; * p < 0.001

Autocorrelation

Moran’s I test of global autocorrelation

The Moran’s I test indicates whether it is useful to use spatial models. It tests against the following null hypothesis:
There is no global spatial correlation in the residuals.
If p ≤ .05, we reject the null hypothesis, i.e., we observe global spatial autocorrelation, i.e. contiguous features tend to have similar values.

moran_regr  <- lm.morantest(ols, nb, zero.policy=T)

if(moran_regr$p.value <= .05) {                   
moran_regr$autoc <- T
  } else {                        
  moran_regr$autoc <- F
}

Parameters

#transpone the moran values into a dataframe
moran_out_num <- t(broom::tidy(moran_regr))

# assign column name (that's important to coerce/cast the data type later)
colnames(moran_out_num) <- c("value")

# Define Replacement patterns to specify the reported row names (estimate1-3 need to get proper labels)
replacement_patterns <- c(
  "estimate1" = "Observed I",
  "estimate2" = "Expectation",
  "estimate3" = "Variance",
  "statistic" = "Statistic",
  "p.value" = "p Value",
  "method" = "Method",
  "alternative" = "Alternative")


# Apply replacements for row names
row.names(moran_out_num) <- replacement_patterns[row.names(moran_out_num)]

# Define the rows that contain strings (i.e., non-numeric values)
chr <- c("Method", "Alternative")

#write character rows into new matrix
moran_out_chr <- as.data.frame(moran_out_num[(rownames(moran_out_num) %in% chr), , drop = FALSE])

#keep only numeric rows in numeric matrix
moran_out_num <- as.data.frame(moran_out_num[!(rownames(moran_out_num) %in% chr), , drop = FALSE])

# coerce/cast data type to numeric, also round to 3 digits
moran_out_num$value <- round(as.numeric(moran_out_num$value),3)

# call the moran results:
# meta information in character rows
DT::datatable(
  moran_out_chr,
  options = list(
    caption = "Moran's I parameters",
    autoWidth = FALSE,
    searching = FALSE,
    bPaginate = FALSE,
    bLengthChange = FALSE,
    bInfo = FALSE,
    columnDefs = list(
      list(targets = 1, # target only the 2nd column (the count seems to start from 0)
           className = "dt-right")  # Set text alignment to right 
    )
  )
)

Estimates

# call the moran results:
# estimates in numeric rows
DT::datatable(
    moran_out_num,
    options = list(caption = "Moran's I estimates",
                   autoWidth = F,
                   searching = F, #disable search field
                   bPaginate = F, #hide pagination control
                   bLengthChange = F, #hide dropdown menu for list length
                   bInfo = F #hide information how many entries are shown
                   )) %>%
    formatRound(1,digits = 3)

The p value is ≤ .05. Thus, we observe spatial autocorrelation within the OLS residuals and should use spatial models.

Moran’s I Plot

The plot below is interactive. It shows data for 7,581 stops (no outliers). Zoom and pan to explore them!

# the eval clause in the header defines that the chunk will only be executed when the geometry type is POLYGON

# load spatial data (important for Moran's I and R^2)
polygons <- readRDS(paste0(ur_s,  "_ausw", vers, ".rds"))
con <- readRDS(paste0(ur_s,  "_ausw", vers, "_con.rds"))
nb <- readRDS(paste0(ur_s,  "_ausw", vers, "_nb.rds"))

#urban areas
polygons_u <- readRDS(paste0(ur_s,  "_ausw", vers, "_u.rds"))
con_u <- readRDS(paste0(ur_s,  "_ausw", vers, "_con_u.rds"))
nb_u <- readRDS(paste0(ur_s,  "_ausw", vers, "_nb_u.rds"))

#rural areas
polygons_r <- readRDS(paste0(ur_s,  "_ausw", vers, "_r.rds"))
con_r <- readRDS(paste0(ur_s,  "_ausw", vers, "_con_r.rds"))
nb_r <- readRDS(paste0(ur_s,  "_ausw", vers, "_nb_r.rds"))


# prepare data:
      #there's one polygon that has no fare accessibility value (02000000-96011, HH-Eissendorf). Assign the average value to it to enable lagged values in the next step
      polygons$fare_acc[is.na(polygons$fare_acc)] <- mean(polygons$fare_acc, na.rm = TRUE)
      
      # calculate lagged values
      polygons$fare_acc.lag <- lag.listw(nb, na.omit(polygons$fare_acc))

col1 <- "#178072"

p_var <- "fare_acc"
p_var_en <- "fare accessibility up to €2.30"

p_xlab <- paste0(p_var_en, " (", p_var, ")")
p_ylab <- paste0("lagged ", p_var_en, " (", p_var, ".lag)")       
      
p <- ggplot(polygons, aes(x = fare_acc, y = fare_acc.lag)) +
  geom_point(color = "black", alpha = .5) +
  geom_smooth(method = "lm", se = FALSE, color = col1) +
  labs(x = p_xlab, y = p_ylab)

p <- ggplot(polygons, aes(x = fare_acc, y = fare_acc.lag)) +
  geom_point(color = "black", alpha = .5,
             aes(text = sprintf("%s<br><b>%s</b><br>fare_acc: \t\t%.2f\nfare_acc.lag: \t%.2fe-6",
                                ags, name, fare_acc, fare_acc.lag))) +
    # geom_point(aes(color = "black", alpha = .5, text = sprintf("<b>fare_acc:</b> %.2f<br><b>fare_acc.lag:</b> %.2f e-6<br><b>ags:</b> <b>%s</b><br><b>name:</b> <b>%s</b>", fare_acc, fare_acc.lag, polygons$ags, polygons$name))) +
  geom_smooth(method = "lm", se = FALSE, color = col1) +
  labs(x = p_xlab, y = p_ylab)

#ggplotly(p)

ggplotly(p, tooltip = "text")

      # # spdep's out-of-the-box plot would this one:
      # spdep::moran.plot(polygons$kk, nb, zero.policy=attr(nb, "zero.policy"), spChk=NULL, labels=F,
      #  xlab=NULL, ylab=NULL, quiet=NULL, plot=TRUE, return_df=F)

# the eval clause in the header defines that the chunk will only be executed when the geometry type is POINT


# prepare data:
      # calculate lagged values
      stops$fare_acc.lag <- lag.listw(nb, na.omit(stops$fare_acc))
      

      # if on raster level: rename the first column to h_id_varchar and add an empty name column
      if (lvl == "z500" | lvl == "z250" | lvl == "z100"){
      names(stops)[1] <- 'h_id_varchar'
      names(stops_u)[1] <- 'h_id_varchar'
      names(stops_r)[1] <- 'h_id_varchar'
      
      stops$h_name <- ""
      stops_u$h_name <- ""
      stops_r$h_name <- ""
      
      # if we're analysing vbb stops that have gtfs as a PK: rename PK column
      } else if (ur_s == "vbb" & lvl == "hst"){
      names(stops)[1] <- 'h_id_varchar'
      names(stops_u)[1] <- 'h_id_varchar'
      names(stops_r)[1] <- 'h_id_varchar'
      }



col1 <- "#178072"

p_var <- "fare_acc"
p_var_en <- "fare accessibility up to €2.30"

p_xlab <- paste0(p_var_en, " (", p_var, ")")
p_ylab <- paste0("lagged ", p_var_en, " (", p_var, ".lag)")       
      
p <- ggplot(stops, aes(x = fare_acc, y = fare_acc.lag)) +
  geom_point(color = "black", alpha = .5) +
  geom_smooth(method = "lm", se = FALSE, color = col1) +
  labs(x = p_xlab, y = p_ylab)

p <- ggplot(stops, aes(x = fare_acc, y = fare_acc.lag)) +
  geom_point(color = "black", alpha = .5,
             aes(text = sprintf("%s<br><b>%s</b><br>fare_acc: \t\t%.2f\nfare_acc.lag: \t%.2fe-6",
                                h_id_varchar, 
                                h_name,
                                fare_acc, fare_acc.lag))) +
    # geom_point(aes(color = "black", alpha = .5, text = sprintf("<b>fare_acc:</b> %.2f<br><b>fare_acc.lag:</b> %.2f e-6<br><b>ags:</b> <b>%s</b><br><b>name:</b> <b>%s</b>", fare_acc, fare_acc.lag, stops$ags, stops$name))) +
  geom_smooth(method = "lm", se = FALSE, color = col1) +
  labs(x = p_xlab, y = p_ylab)

#ggplotly(p)

ggplotly(p, tooltip = "text")

      # # spdep's out-of-the-box plot would this one:
      # spdep::moran.plot(stops$kk, nb, zero.policy=attr(nb, "zero.policy"), spChk=NULL, labels=F,
      #  xlab=NULL, ylab=NULL, quiet=NULL, plot=TRUE, return_df=F)

LISA test of local autocorrelation

# define text chunks that explain LISA (or LISA's absence), which will be called depending on the variable lisa_draw that defines if a plot shall be drawn, depending on the spatial level.

lisa_T <- paste0("The LISA test (local indicators of spatial association) measures
autocorrelation at the local level, i.e. for each area compared to
its immediate neighbours. Again, it is the level of statistical significance that is relevant. LISA can be
mapped to explore spatial patterns, as the following plot shows for the distribution of ", dv_en, ".
The plot shows four different LISA patterns by comparing each area with its neighbours. Note that LISA shows significant patterns of of clustering. However, it does not allow any interpretation of the values themselves -- or, as Luc Anselin puts it, ['[LISA] only suggests interesting locations ... it does not prove anything!'](https://youtu.be/xMzK1iNgDFE?si=OrMu_kx-VYCbZC05&t=1305)

###### The plot below shows data for ", formatC(nobs(ols), format = "d", big.mark = ","), "&nbsp;", lvl_en_p, " (no outliers).")

lisa_F <- paste0("A LISA plot for the ", lvl, " level would require much computing capacity. As it is based on a nearest-neighbour-analysis, it would deliver a picture quite similar to the other spatial levels. Therefore, this report does not include a LISA plot. Please refer to the reports for ", lvl_oth, " levels, which can be downloaded via [DOI:10.15480/882.13162](https://doi.org/10.15480/882.13162).")

The LISA test (local indicators of spatial association) measures autocorrelation at the local level, i.e. for each area compared to its immediate neighbours. Again, it is the level of statistical significance that is relevant. LISA can be mapped to explore spatial patterns, as the following plot shows for the distribution of fare accessibility. The plot shows four different LISA patterns by comparing each area with its neighbours. Note that LISA shows significant patterns of of clustering. However, it does not allow any interpretation of the values themselves – or, as Luc Anselin puts it, ‘[LISA] only suggests interesting locations … it does not prove anything!’

The plot below shows data for 7,581 stops (no outliers).

# the eval clause in the header defines that the chunk will only be executed when the geometry type is POLYGON
# Compute LISA plot
# Inspired by https://stackoverflow.com/a/70387888

# Calculate a separate nb matrix for local LISA test
# Note that I assign a 30m buffer to heal the boundary effect between metropolis and outskirts
if (!exists("nb_loc")) {
  # If it doesn't exist, create nb_loc
  nb_loc <- rgeoda::queen_weights(st_buffer(polygons, 30))
}

# Calculate local Moran's I
moran_loc <- rgeoda::local_moran(nb_loc, st_drop_geometry(polygons["fare_acc"]))
moran_loc_lbls <- lisa_labels(moran_loc)
moran_loc_colors <- setNames(lisa_colors(moran_loc), moran_loc_lbls)

# Define my own colors instead of the standard ones (i.e., replace them in the respective array)
moran_loc_colors <- dplyr::recode(
  moran_loc_colors,
  "#eeeeee" = "#eeeeee",  # not significant
  "#FF0000" = "#0f4f4c",  # high-high
  "#f4ada8" = "#178072",  # high-low
  "#0000FF" = "#2DC6D6",  # low-low
  "#a7adf9" = "#C1EEF3",  # low-high
  "#464646" = "#464646",  # undefined
  "#999999" = "#999999"   # isolated
)

# Define clusters
moran_clustered <- polygons %>%
  st_drop_geometry() %>%
  select(ags) %>%
  mutate(cluster_num = lisa_clusters(moran_loc) + 1,  # Add 1 because clusters are zero-indexed
         cluster = factor(moran_loc_lbls[cluster_num], levels = moran_loc_lbls)) %>%
  right_join(polygons, by = "ags") %>%
  st_as_sf()

# # Map the clusters as a static ggplot
# ggplot(moran_clustered, aes(fill = cluster)) +
#   geom_sf(color = "white", size = 0) +
#   scale_fill_manual(name = "New Legend Title", values = moran_loc_colors, na.value = "grey90") +
#   theme_minimal() +
#   theme(panel.grid = element_blank(),
#         axis.text = element_blank(),
#         axis.title = element_blank(),
#         panel.border = element_blank(),
#         panel.background = element_blank())

# the eval clause in the header defines that the chunk will only be executed when the geometry type is POLYGON

# Define color palette
pal <- colorFactor(moran_loc_colors, domain = moran_clustered$cluster)

# Define LISA label for each location, based on numeric cluster field
moran_clustered <- mutate(moran_clustered, cluster_label = case_when(
  cluster_num == 1 ~ "No significant clustering",
  cluster_num == 2 ~ "<strong>High fare_acc</strong> in this area, <strong>high fare_acc</strong> in the neighbouring areas",
  cluster_num == 3 ~ "<strong>Low fare_acc</strong> in this area, <strong>low fare_acc</strong> in the neighbouring areas",
  cluster_num == 4 ~ "<strong>Low fare_acc</strong> in this area, <strong>high fare_acc</strong> in the neighbouring areas",
  cluster_num == 5 ~ "<strong>High fare_acc</strong> in this area, <strong>low fare_acc</strong> in the neighbouring areas",
  cluster_num == 7 ~ "<strong>Isolated:</strong>This area has no neighbours",
  TRUE ~ as.character(cluster_num)
))

moran_clustered$label <- paste0(
  "<strong>", moran_clustered$ags, "<br/>",
  moran_clustered$name, "</strong><br/>",
  moran_clustered$cluster_label
) %>% lapply(htmltools::HTML)

# Render map
leaflet(moran_clustered) %>%
  addProviderTiles('CartoDB.Positron', group = "Basemap") %>%
  addPolygons(stroke = TRUE, smoothFactor = 0,
              fillColor = ~pal(cluster),
              color = "grey",  # Thin grey outline for better visibility
              weight = 0.5,    # Outline thickness
              fillOpacity = 0.7,  # Transparency of the polygons
              label = ~label,
              highlight = highlightOptions(
                color = "white",
                bringToFront = TRUE
              ),
              layerId = ~ags,
              group = "LISA Clusters"
  ) %>%
  addLegend(position = 'bottomright', pal = pal,
            values = moran_clustered$cluster, title = 'LISA Clusters') %>%
  addFullscreenControl() %>%
  addLayersControl(
    baseGroups = c("Basemap", "No Basemap"),
    overlayGroups = c("LISA Clusters"),
    options = layersControlOptions(collapsed = FALSE)
  ) %>%
  hideGroup("No Basemap")  # Start with the basemap visible

# the eval clause in the header defines that the chunk will only be executed when the geometry type is POINT and the level is not raster 500m. The raster LISA plot would require too much computing capacity for only little gain in result.

# Compute LISA plot

# set CRS
#stops <- st_transform(stops, 4326)

st_crs(stops) <- 3857


# for testing: Create a subset of ~100 stops
#stops <- subset(stops, km_zentrum<=50)

#st_crs(stops) # check CRS
#st_crs(moran_clustered) # check CRS

# Check if matrix on local weights exists
#if (!exists("nb_loc")) {
  # If it doesn't exist, create matrix
  nb_loc <- rgeoda::queen_weights(stops)


# # Drop geometry to pass only numeric values to local_moran
stops_nogeom <- st_drop_geometry(stops)

# Calculate local Moran's I (fare_acc is the variable I want to investigate)
 moran_loc <- rgeoda::local_moran(nb_loc, stops_nogeom["fare_acc"])


moran_loc_lbls <- lisa_labels(moran_loc)
moran_loc_colors <- setNames(lisa_colors(moran_loc), moran_loc_lbls)


# Define my own colors instead of the standard ones (i.e., replace them in the respective array)
moran_loc_colors <- dplyr::recode(
  moran_loc_colors,
  "#eeeeee" = "#eeeeee",  # not significant
  "#FF0000" = "#0f4f4c",  # high-high
  "#f4ada8" = "#178072",  # high-low
  "#0000FF" = "#2DC6D6",  # low-low
  "#a7adf9" = "#C1EEF3",  # low-high
  "#464646" = "#464646",  # undefined
  "#999999" = "#999999"   # isolated
)

# Define clusters
moran_clustered <- stops %>%
  mutate(cluster_num = lisa_clusters(moran_loc) + 1,  # add 1 bc clusters are zero-indexed
         cluster = factor(moran_loc_lbls[cluster_num], levels = moran_loc_lbls),
         h_id_varchar = as.character(h_id_varchar))  # Ensure h_id_varchar is character type

# the eval clause in the header defines that the chunk will only be executed when the geometry type is POINT

# define color palette

pal <- colorFactor(moran_loc_colors, domain = moran_clustered$cluster)

# define LISA label for each location, based on numeric cluster field
# define LISA label for each location, based on numeric cluster field
# note that ggplotly only allows the following HTML tags: <br>, <sub>, <sup>, <b>, <i>, <em>
# (https://stackoverflow.com/a/57812154/15097106)

moran_clustered <- mutate(moran_clustered, cluster_label = case_when(
    cluster_num == 1 ~ "No significant clustering",
    cluster_num == 2 ~ "<b>High fare_acc</b> at this stop,<br><b>high fare_acc</b> at the neighbouring stops",
    cluster_num == 3 ~ "<b>Low fare_acc</b> at this stop,<br><b>low fare_acc</b> at the neighbouring stops",
    cluster_num == 4 ~ "<b>Low fare_acc</b> at this stop,<br><b>high fare_acc</b> at the neighbouring stops",
    cluster_num == 5 ~ "<b>High fare_acc</b> at this stop,<br><b>low fare_acc</b> at the neighbouring stops",
    cluster_num == 7 ~ "<b>Isolated:</b> This stop has no neighbours",
    TRUE ~ as.character(cluster_num)  # Default label for other cases
  ))
# note that ggplotly only allows the following HTML tags: <br>, <sub>, <sup>, <b>, <i>, <em>
# (https://stackoverflow.com/a/57812154/15097106)


# Construct labels
moran_clustered$label <- paste0(
  "<b>",
  moran_clustered$h_name, "</b><br>",
  moran_clustered$h_id_varchar, "<br>",
  moran_clustered$cluster_label
)

#ensure the correct EPSG
moran_clustered <- st_transform(moran_clustered, crs = st_crs(3857))
moran_clustered <- st_set_crs(moran_clustered, 3857)

# draw a ggplot with labels

# there were serious issues with the CRS which cost me 2 days of work.
# In the end, I decided not to reproject the layers but create a new joined layer of the stops geometry and selected attributes of the moran_clustered sf, and draw an interactive ggplot instead of a leaflet map.

stops_joined <- stops %>%
  left_join(moran_clustered %>% st_drop_geometry() %>% select(h_id_varchar, cluster, label), 
            by = "h_id_varchar")
 

# Ensure the 'label' column is a character vector
stops_joined$label <- as.character(stops_joined$label)


# Debug: Check for non-finite coordinates
# non_finite_coords <- sum(!is.finite(st_coordinates(stops_joined)))
# print(paste("Number of non-finite coordinates:", non_finite_coords))


# Create the ggplot object
p <- ggplot(data = stops_joined) +
  geom_sf(aes(color = cluster, text = label), size = 1) +
  scale_color_manual(values = moran_loc_colors) +
  theme_void() +  # Remove all axes and grids
  theme(
    plot.background = element_rect(fill = "grey10"),  # Set background to grey10
    panel.background = element_rect(fill = "grey0"),  # Set panel background to grey0
    plot.title = element_text(color = "white", size = 14, face = "bold"),  # Title font to white
    legend.title = element_text(color = "white"),  # Legend title font to white
    legend.text = element_text(color = "white")    # Legend text font to white
  ) +
  labs(
    title = "LISA plot for fare accessibility",
    color = "Cluster"
  )

#p
# not functional: 
# Convert to an interactive plotly plot
# 
#p_interactive <- ggplotly(p, tooltip = "text")

p_interactive <- ggplotly(p)


# Ensure fullscreen button and mode bar are enabled
p_interactive <- p_interactive %>%
  config(
    displayModeBar = TRUE,  # Ensure mode bar is displayed
    displaylogo = FALSE,    # Optionally, hide Plotly logo
    modeBarButtonsToAdd = c('fullscreen')  # Add fullscreen button to mode bar
  )

# Render the interactive plot
p_interactive

Restrict models?

It may be advisable to simplify the statistical analysis by using a more basic model. The Likelihood Ratio test (LR) indicates whether it is necessary to restrict a spatial model to a spatial model or even to a non-spatial model, as explained in Burkey’s 2nd video (minute 21).

SDEM to SEM?

Firstly, we execute an LR test against the null hypothesis: We should restrict the model to SEM.
If p > .05, we can’t reject H0 and we should restrict the model to SEM.

Parameters

restr2sem  <- spatialreg::LR.Sarlm(sdem, sem )
if(restr2sem$p.value >= .05) {                   
restr2sem$restrict <- T
  } else {                        
  restr2sem$restrict <- F
}
# restr2sem
#transpone the lr values into a dataframe
lr_out_num <- t(broom::tidy(restr2sem))

# assign column name (that's important to coerce/cast the data type later)
colnames(lr_out_num) <- c("value")

# Define Replacement patterns to specify the reported row names (estimate1-3 need to get proper labels)
replacement_patterns <- c(
  "estimate1" = "Log likelihood of SDEM",
  "estimate2" = "Log likelihood of SEM",
  "statistic" = "Likelihood ratio",
  "p.value" = "p Value",
  "parameter" = "Parameter",
  "method" = "Method")

# Apply replacements for row names
row.names(lr_out_num) <- replacement_patterns[row.names(lr_out_num)]

# Define the rows that contain strings (i.e., non-numeric values)
chr <- c("Method")

#write character rows into new matrix
lr_out_chr <- as.data.frame(lr_out_num[(rownames(lr_out_num) %in% chr), , drop = FALSE])

#keep only numeric rows in numeric matrix
lr_out_num <- as.data.frame(lr_out_num[!(rownames(lr_out_num) %in% chr), , drop = FALSE])

# coerce/cast data type to numeric, also round to 3 digits
lr_out_num$value <- round(as.numeric(lr_out_num$value),3)

# call the results:
# meta information in character rows
DT::datatable(
  lr_out_chr,
  options = list(
    caption = "Likelihood Ratio parameters for SDEM → SEM",
    autoWidth = FALSE,
    searching = FALSE,
    bPaginate = FALSE,
    bLengthChange = FALSE,
    bInfo = FALSE,
    columnDefs = list(
      list(targets = 1, # target only the 2nd column (the count seems to start from 0)
           className = "dt-right")  # Set text alignment to right 
    )
  )
)

Estimates

# call the moran results:
# estimates in numeric rows
DT::datatable(
    lr_out_num,
    options = list(caption = "Likelihood Ratio estimates for SDEM → SEM",
                   autoWidth = F,
                   searching = F, #disable search field
                   bPaginate = F, #hide pagination control
                   bLengthChange = F, #hide dropdown menu for list length
                   bInfo = F #hide information how many entries are shown
                   )) %>%
    formatRound(1,digits = 3)

The p value is ≤ .05. Thus, it is not advised to restrict the model to SEM.

SDEM to SLX?

Secondly, we execute an LR test against the null hypothesis: We should restrict the model to SLX (similar to the H0 above).
If p > .05, we can’t reject H0 and we should restrict the model to SLX.
If p ≤ .05,we do not need restrict the SDEM model. (In this case, we don’t need to restrict to OLS either).

Parameters

restr2slx  <- spatialreg::LR.Sarlm(sdem, slx )
if(restr2slx$p.value >= .05) {                   
restr2slx$restrict <- T
  } else {                        
  restr2slx$restrict <- F
}
# restr2slx
#transpone the lr values into a dataframe
lr_out_num <- t(broom::tidy(restr2slx))

# assign column name (that's important to coerce/cast the data type later)
colnames(lr_out_num) <- c("value")

# Define Replacement patterns to specify the reported row names (estimate1-3 need to get proper labels)
replacement_patterns <- c(
  "estimate1" = "Log likelihood of SDEM",
  "estimate2" = "Log likelihood of SLX",
  "statistic" = "Likelihood ratio",
  "p.value" = "p Value",
  "parameter" = "Parameter",
  "method" = "Method")

# Apply replacements for row names
row.names(lr_out_num) <- replacement_patterns[row.names(lr_out_num)]

# Define the rows that contain strings (i.e., non-numeric values)
chr <- c("Method")

#write character rows into new matrix
lr_out_chr <- as.data.frame(lr_out_num[(rownames(lr_out_num) %in% chr), , drop = FALSE])

#keep only numeric rows in numeric matrix
lr_out_num <- as.data.frame(lr_out_num[!(rownames(lr_out_num) %in% chr), , drop = FALSE])

# coerce/cast data type to numeric, also round to 3 digits
lr_out_num$value <- round(as.numeric(lr_out_num$value),3)

# call the results:
# meta information in character rows
DT::datatable(
  lr_out_chr,
  options = list(
    caption = "Likelihood Ratio parameters for SDEM → SLX",
    autoWidth = FALSE,
    searching = FALSE,
    bPaginate = FALSE,
    bLengthChange = FALSE,
    bInfo = FALSE,
    columnDefs = list(
      list(targets = 1, # target only the 2nd column (the count seems to start from 0)
           className = "dt-right")  # Set text alignment to right 
    )
  )
)

Estimates

# call the moran results:
# estimates in numeric rows
DT::datatable(
  lr_out_num,
  options = list(caption = "Likelihood Ratio estimates for SDEM → SLX",
                 autoWidth = F,
                 searching = F, #disable search field
                 bPaginate = F, #hide pagination control
                 bLengthChange = F, #hide dropdown menu for list length
                 bInfo = F #hide information how many entries are shown
  )) %>%
  formatRound(1,digits = 3)

The p value is ≤ .05. Thus, it is not advised to restrict the model to SLX.

SDEM to OLS?

Thirdly, we execute an LR test against the null hypothesis: We should restrict the model to OLS (similar to the H0 above).
If p > .05, we can’t reject H0 and we should restrict the model to OLS.
If p ≤ .05,we do not need restrict the SDEM model.

Parameters

restr2ols  <- spatialreg::LR.Sarlm(sdem, ols )
if(restr2ols$p.value >= .05) {                   
  restr2ols$restrict <- T
} else {                        
  restr2ols$restrict <- F
}
# restr2ols
#transpone the lr values into a dataframe
lr_out_num <- t(broom::tidy(restr2ols))

# assign column name (that's important to coerce/cast the data type later)
colnames(lr_out_num) <- c("value")

# Define Replacement patterns to specify the reported row names (estimate1-3 need to get proper labels)
replacement_patterns <- c(
  "estimate1" = "Log likelihood of SDEM",
  "estimate2" = "Log likelihood of OLS",
  "statistic" = "Likelihood ratio",
  "p.value" = "p Value",
  "parameter" = "Parameter",
  "method" = "Method")

# Apply replacements for row names
row.names(lr_out_num) <- replacement_patterns[row.names(lr_out_num)]

# Define the rows that contain strings (i.e., non-numeric values)
chr <- c("Method")

#write character rows into new matrix
lr_out_chr <- as.data.frame(lr_out_num[(rownames(lr_out_num) %in% chr), , drop = FALSE])

#keep only numeric rows in numeric matrix
lr_out_num <- as.data.frame(lr_out_num[!(rownames(lr_out_num) %in% chr), , drop = FALSE])

# coerce/cast data type to numeric, also round to 3 digits
lr_out_num$value <- round(as.numeric(lr_out_num$value),3)

# call the results:
# meta information in character rows
DT::datatable(
  lr_out_chr,
  options = list(
    caption = "Likelihood Ratio parameters for SDEM → OLS",
    autoWidth = FALSE,
    searching = FALSE,
    bPaginate = FALSE,
    bLengthChange = FALSE,
    bInfo = FALSE,
    columnDefs = list(
      list(targets = 1, # target only the 2nd column (the count seems to start from 0)
           className = "dt-right")  # Set text alignment to right 
    )
  )
)

Estimates

# call the moran results:
# estimates in numeric rows
DT::datatable(
  lr_out_num,
  options = list(caption = "Likelihood Ratio estimates for SDEM → OLS",
                 autoWidth = F,
                 searching = F, #disable search field
                 bPaginate = F, #hide pagination control
                 bLengthChange = F, #hide dropdown menu for list length
                 bInfo = F #hide information how many entries are shown
  )) %>%
  formatRound(1,digits = 3)

The p value is ≤ .05. Thus, it is not advised to restrict the model to OLS.

Goodness-of-fit

Breusch-Pagan Test for heteroskedasticity in the SDEM

The studentized Breusch-Pagan test measures heteroskedasticity, i.e. whether the residuals of the dependent variable depend on the independent variable. It tests against the null hypothesis: There is no heteroskedasticity. If p ≤ .05, we can reject H0 and have to assume heteroskedasticity. This [slightly] affects our standard errors and p values in the regression, but not the coefficients. The effect on the p values could lead to incorrect hypothesis testing in the models above.
If p > .05, we can assume that our regression model is not affected by heteroskedasticity.

Estimates

bp_sdem <- spatialreg::bptest.Sarlm(sdem, studentize = TRUE)
if(bp_sdem$p.value < .05) {                   
  bp_sdem$het <- T
} else {                        
  bp_sdem$het <- F
}
# bp_sdem

#transpone the lr values into a dataframe
bp_out_num <- t(broom::tidy(bp_sdem))

# assign column name (that's important to coerce/cast the data type later)
colnames(bp_out_num) <- c("value")

# Define Replacement patterns to specify the reported row names (estimate1-3 need to get proper labels)
replacement_patterns <- c(
  "statistic" = "Breusch-Pagan Statistic",
  "p.value" = "p Value",
  "parameter" = "Degrees of Freedom",
  "method" = "Method")

# Apply replacements for row names
row.names(bp_out_num) <- replacement_patterns[row.names(bp_out_num)]

# Define the rows that contain strings (i.e., non-numeric values)
chr <- c("Method")

#write character rows into new matrix
bp_out_chr <- as.data.frame(bp_out_num[(rownames(bp_out_num) %in% chr), , drop = FALSE])

#keep only numeric rows in numeric matrix
bp_out_num <- as.data.frame(bp_out_num[!(rownames(bp_out_num) %in% chr), , drop = FALSE])

# coerce/cast data type to numeric, also round to 3 digits
bp_out_num$value <- round(as.numeric(bp_out_num$value),3)

# call the results:
# estimates in numeric rows
DT::datatable(
  bp_out_num,
  options = list(caption = "Breusch-Pagan statistics for SDEM",
                 autoWidth = F,
                 searching = F, #disable search field
                 bPaginate = F, #hide pagination control
                 bLengthChange = F, #hide dropdown menu for list length
                 bInfo = F #hide information how many entries are shown
  )) %>%
  formatRound(1,digits = 3)

The p value is ≤ .05. Thus, we have to assume heteroskedasticity in the model.

Spatial Hausman test for SEM

The Spatial Hausman test compares the SEM to the OLS parameters. It tests against the null hypothesis: There is no difference beween the SEM and OLS parameters.
If p ≤ .05, we can reject H0 and have to assume that neither of the two models are a good fit to the data and/or that the models are not the optimal choice to capture spatial autocorrelation (Burkey 1, minute 33ff).

Estimates

hausman_regr <- spatialreg::Hausman.test(sem)

if(hausman_regr$p.value <= .05) {                   
  hausman_regr$significant <- T
} else {                        
  hausman_regr$significant <- F
}

#transpone the lr values into a dataframe
hausman_out_num <- t(broom::tidy(hausman_regr))

# assign column name (that's important to coerce/cast the data type later)
colnames(hausman_out_num) <- c("value")

# Define Replacement patterns to specify the reported row names (estimate1-3 need to get proper labels)
replacement_patterns <- c(
  "statistic" = "Spatial Hausman Statistic",
  "p.value" = "p Value",
  "parameter" = "Degrees of Freedom",
  "method" = "Method")

# Apply replacements for row names
row.names(hausman_out_num) <- replacement_patterns[row.names(hausman_out_num)]

# Define the rows that contain strings (i.e., non-numeric values)
chr <- c("Method")

#write character rows into new matrix
hausman_out_chr <- as.data.frame(hausman_out_num[(rownames(hausman_out_num) %in% chr), , drop = FALSE])

#keep only numeric rows in numeric matrix
hausman_out_num <- as.data.frame(hausman_out_num[!(rownames(hausman_out_num) %in% chr), , drop = FALSE])

# coerce/cast data type to numeric, also round to 3 digits
hausman_out_num$value <- round(as.numeric(hausman_out_num$value),3)

# call the results:
# estimates in numeric rows
DT::datatable(
  hausman_out_num,
  options = list(caption = "Breusch-Pagan statistics for SDEM",
                 autoWidth = F,
                 searching = F, #disable search field
                 bPaginate = F, #hide pagination control
                 bLengthChange = F, #hide dropdown menu for list length
                 bInfo = F #hide information how many entries are shown
  )) %>%
  formatRound(1,digits = 3)

The p value is ≤ .05. This suggests that neither OLS nor SEM are appropriate to capture the effects in our spatial data as the model parameters differ to a great extent.

Test of 10 + 1 hypotheses

In my doctoral thesis, I put forward 13 hypotheses concerning fare accessibility, purchasing power and residential density (see literature section of my dissertation). Eleven of the hypotheses are tested at the stop level in this report, i.e. U1-U5, R1-R5, and G1. The remaining hypotheses G2 and G3 compare the goodness-of-fit of the models across the spatial levels, which is not possible at this level of analysis. Therefore, hypotheses G2 and G3 are being discussed in the results section of my thesis. All bivariate hypotheses are tested based on normalized input variables (see explanation above).

Five hypotheses for urban areas – bivariate models

U1: Low purchasing power predicts high public transportation service

#define the variables for this Pseudo-R^2 and for the caption
regr <- "pt_index_zt"
dv <- regr
dv_en <- oevx_en
iv <- "purch_power_zt"
iv_en <- kk_en
iv_unit_en <- kk_unit_en
hyp <- 1

hyp_sdem <- paste0("u", hyp, "_sdem_zt")
hyp_sem <- paste0("u", hyp, "_sem_zt")
hyp_slx <- paste0("u", hyp, "_slx_zt")
hyp_ols <- paste0("u", hyp, "_ols_zt")


 ##### Calculate Pseudo-R^2 for SDEM and SEM to list them in the table
# Note that the formula points to the dynamic r variable, i.e., oevx/t2/whatever I assign as the dependent variable in the file 1_make_models.R! 
#Furthermore, the get(...) calls dynamically point to the variables I have defined above, depending on my dissertation hypotheses

#SDEM
psr2_sdem <- 1 - (get(hyp_sdem)$SSE / (var(stops_u[[regr]]) * (length(stops_u[[regr]]) - 1)))
psr2_sdem <- round(psr2_sdem,2)


##### SEM
psr2_sem <- 1 - (get(hyp_sem)$SSE / (var(stops_u[[regr]]) * (length(stops_u[[regr]]) - 1)))
psr2_sem <- round(psr2_sem,2)

# extract p value for hyp testing
#hyp_p <- coef(summary(get(hyp_sdem)))[iv, "Pr(>|z|)"]  
hyp_p <- coef(summary(get(hyp_sdem)))[which(rownames(coef(summary(get(hyp_sdem)))) == iv), "Pr(>|z|)"]

# Generate the htmlreg table with custom rows for Pseudo-R^2^ 
model_table <- texreg::htmlreg(
  list(get(hyp_sdem), get(hyp_sem), get(hyp_slx), get(hyp_ols)),
  doctype = FALSE,
  custom.model.names = model_names,
  custom.gof.rows = list('Pseudo-R^2^' = c(psr2_sdem, psr2_sem, '', '')), # this one inserts two values and two empty strings into the table
   reorder.gof = c(2:10, 1, 11:19), # reorder the R^2^ rows, i.e., insert row number 1  in-between row 10 (last R^2^ of the regular models) and row 11.
  caption = paste0("Bivariate regression estimates of the four models<br>
                   Independent variable: ", iv_en, " (", iv, ")",
                   "<br>
                   Dependent variable: ", dv_en, " (", dv, ")<br>
                   The sample is stripped of outliers [(see boxplot above)](#bplot_u)."),
  custom.note = cap_sig)


# Create a kableExtra table and set styles for all columns
model_table_styled <- kable(
  model_table,
  format = "html",
  escape = FALSE
) %>%
  kable_styling(full_width = FALSE) %>%  # Set to TRUE if you want full-width
  column_spec(1, extra_css = "text-align: right;")  # Align all columns to the right

# Print or use the modified HTML table
print(model_table_styled)

x

Bivariate regression estimates of the four models
Independent variable: purchasing power (purch_power_zt)
Dependent variable: public transport index (pt_index_zt)
The sample is stripped of outliers (see boxplot above).
	SDEM	SEM	SLX	OLS
(Intercept)	0.60^***	0.61^***	0.72^***	0.72^***
	(0.03)	(0.03)	(0.01)	(0.01)
purch_power_zt	-0.10^***	-0.11^***	-0.17^***	-0.08^***
	(0.02)	(0.02)	(0.04)	(0.01)
lag.purch_power_zt	-0.08^*		0.11^*
	(0.03)		(0.04)
lambda	0.85^***	0.85^***
	(0.01)	(0.01)
Num. obs.	2655	2655		2655
Parameters	5	4
Log Likelihood	-121.85	-124.36	-1437.22
AIC (Linear model)	2882.44	2886.52
AIC (Spatial model)	253.69	256.71
LR test: statistic	2630.74	2631.81
LR test: p-value	0.00	0.00
R²			0.02	0.02
Adj. R²			0.02	0.02
Pseudo-R²	0.71	0.71
Sigma			0.42
Statistic			24.67
P Value			0.00
DF			2.00
AIC			2882.44
BIC			2905.97
Deviance			458.96
DF Resid.			2652
nobs			2655
* p < 0.05; p < 0.01; * p < 0.001

Hypothesis U1 states that low purchasing power predicts high public transportation service. As the beta coefficient of the bivariate regression shows, the purchasing power (purch_power_zt) exerts an effect of -0.1 on the public transport index (pt_index_zt). This means that for an increase of 1 standard deviation of purch_power_zt at a given stop, we should expect a decrease of 0.1 standard deviations of pt_index_zt. As the variables are normalized, we must not interpret them in their original units.

The bivariate regression estimator is negative. The regression estimate is ***significant.

Hence, hypothesis U1 is verified.

U2: High density predicts low purchasing power

#define the variables for this Pseudo-R^2 and for the caption
regr <- "purch_power_zt"
dv <- regr
dv_en <- oevx_en
iv <- "pop_dens_zt"
iv_en <- pop_ha_en
iv_unit_en <- pop_ha_unit_en
hyp <- 2

hyp_sdem <- paste0("u", hyp, "_sdem_zt")
hyp_sem <- paste0("u", hyp, "_sem_zt")
hyp_slx <- paste0("u", hyp, "_slx_zt")
hyp_ols <- paste0("u", hyp, "_ols_zt")


 ##### Calculate Pseudo-R^2 for SDEM and SEM to list them in the table
# Note that the formula points to the dynamic r variable, i.e., oevx/t2/whatever I assign as the dependent variable in the file 1_make_models.R! 
#Furthermore, the get(...) calls dynamically point to the variables I have defined above, depending on my dissertation hypotheses

#SDEM
psr2_sdem <- 1 - (get(hyp_sdem)$SSE / (var(stops_u[[regr]]) * (length(stops_u[[regr]]) - 1)))
psr2_sdem <- round(psr2_sdem,2)


##### SEM
psr2_sem <- 1 - (get(hyp_sem)$SSE / (var(stops_u[[regr]]) * (length(stops_u[[regr]]) - 1)))
psr2_sem <- round(psr2_sem,2)

# extract p value for hyp testing
#hyp_p <- coef(summary(get(hyp_sdem)))[iv, "Pr(>|z|)"]  
hyp_p <- coef(summary(get(hyp_sdem)))[which(rownames(coef(summary(get(hyp_sdem)))) == iv), "Pr(>|z|)"]

# Generate the htmlreg table with custom rows for Pseudo-R^2^ 
model_table <- texreg::htmlreg(
  list(get(hyp_sdem), get(hyp_sem), get(hyp_slx), get(hyp_ols)),
  doctype = FALSE,
  custom.model.names = model_names,
  custom.gof.rows = list('Pseudo-R^2^' = c(psr2_sdem, psr2_sem, '', '')), # this one inserts two values and two empty strings into the table
   reorder.gof = c(2:10, 1, 11:19), # reorder the R^2^ rows, i.e., insert row number 1  in-between row 10 (last R^2^ of the regular models) and row 11.
  caption = paste0("Bivariate regression estimates of the four models<br>
                   Independent variable: ", iv_en, " (", iv, ")",
                   "<br>
                   Dependent variable: ", dv_en, " (", dv, ")<br>
                   The sample is stripped of outliers [(see boxplot above)](#bplot_u)."),
  custom.note = cap_sig)


# Create a kableExtra table and set styles for all columns
model_table_styled <- kable(
  model_table,
  format = "html",
  escape = FALSE
) %>%
  kable_styling(full_width = FALSE) %>%  # Set to TRUE if you want full-width
  column_spec(1, extra_css = "text-align: right;")  # Align all columns to the right

# Print or use the modified HTML table
print(model_table_styled)

x

Bivariate regression estimates of the four models
Independent variable: residential density (pop_dens_zt)
Dependent variable: public transport index (purch_power_zt)
The sample is stripped of outliers (see boxplot above).
	SDEM	SEM	SLX	OLS
(Intercept)	0.51^***	0.50^***	0.09^***	0.11^***
	(0.04)	(0.04)	(0.02)	(0.02)
pop_dens_zt	-0.33^***	-0.33^***	-0.36^***	-0.20^***
	(0.01)	(0.01)	(0.04)	(0.01)
lag.pop_dens_zt	-0.01		0.17^***
	(0.02)		(0.04)
lambda	0.94^***	0.94^***
	(0.00)	(0.00)
Num. obs.	2655	2655		2655
Parameters	5	4
Log Likelihood	290.13	290.08	-2405.08
AIC (Linear model)	4818.16	4831.45
AIC (Spatial model)	-570.25	-572.16
LR test: statistic	5390.42	5405.61
LR test: p-value	0.00	0.00
R²			0.13	0.12
Adj. R²			0.12	0.12
Pseudo-R²	0.92	0.92
Sigma			0.60
Statistic			190.54
P Value			0.00
DF			2.00
AIC			4818.16
BIC			4841.70
Deviance			951.52
DF Resid.			2652
nobs			2655
* p < 0.05; p < 0.01; * p < 0.001

Hypothesis U2 states that high residential density predicts low purchasing power. As the beta coefficient of the bivariate regression shows, the residential density (pop_dens_zt) exerts an effect of -0.33 on the public transport index (purch_power_zt). This means that for an increase of 1 standard deviation of pop_dens_zt at a given stop, we should expect a decrease of 0.33 standard deviations of purch_power_zt. As the variables are normalized, we must not interpret them in their original units.

The bivariate regression estimator is negative. The regression estimate is ***significant.

Hence, hypothesis U2 is verified.

U3: Low purchasing power predicts high fare accessibility

#define the variables for this Pseudo-R^2 and for the caption
regr <- "fare_acc_zt"
dv <- regr
dv_en <- oevx_en
iv <- "purch_power_zt"
iv_en <- kk_en
iv_unit_en <- kk_unit_en
hyp <- 3

hyp_sdem <- paste0("u", hyp, "_sdem_zt")
hyp_sem <- paste0("u", hyp, "_sem_zt")
hyp_slx <- paste0("u", hyp, "_slx_zt")
hyp_ols <- paste0("u", hyp, "_ols_zt")


 ##### Calculate Pseudo-R^2 for SDEM and SEM to list them in the table
# Note that the formula points to the dynamic r variable, i.e., oevx/t2/whatever I assign as the dependent variable in the file 1_make_models.R! 
#Furthermore, the get(...) calls dynamically point to the variables I have defined above, depending on my dissertation hypotheses

#SDEM
psr2_sdem <- 1 - (get(hyp_sdem)$SSE / (var(stops_u[[regr]]) * (length(stops_u[[regr]]) - 1)))
psr2_sdem <- round(psr2_sdem,2)


##### SEM
psr2_sem <- 1 - (get(hyp_sem)$SSE / (var(stops_u[[regr]]) * (length(stops_u[[regr]]) - 1)))
psr2_sem <- round(psr2_sem,2)

# extract p value for hyp testing
#hyp_p <- coef(summary(get(hyp_sdem)))[iv, "Pr(>|z|)"]  
hyp_p <- coef(summary(get(hyp_sdem)))[which(rownames(coef(summary(get(hyp_sdem)))) == iv), "Pr(>|z|)"]

# Generate the htmlreg table with custom rows for Pseudo-R^2^ 
model_table <- texreg::htmlreg(
  list(get(hyp_sdem), get(hyp_sem), get(hyp_slx), get(hyp_ols)),
  doctype = FALSE,
  custom.model.names = model_names,
  custom.gof.rows = list('Pseudo-R^2^' = c(psr2_sdem, psr2_sem, '', '')), # this one inserts two values and two empty strings into the table
   reorder.gof = c(2:10, 1, 11:19), # reorder the R^2^ rows, i.e., insert row number 1  in-between row 10 (last R^2^ of the regular models) and row 11.
  caption = paste0("Bivariate regression estimates of the four models<br>
                   Independent variable: ", iv_en, " (", iv, ")",
                   "<br>
                   Dependent variable: ", dv_en, " (", dv, ")<br>
                   The sample is stripped of outliers [(see boxplot above)](#bplot_u)."),
  custom.note = cap_sig)


# Create a kableExtra table and set styles for all columns
model_table_styled <- kable(
  model_table,
  format = "html",
  escape = FALSE
) %>%
  kable_styling(full_width = FALSE) %>%  # Set to TRUE if you want full-width
  column_spec(1, extra_css = "text-align: right;")  # Align all columns to the right

# Print or use the modified HTML table
print(model_table_styled)

x

Bivariate regression estimates of the four models
Independent variable: purchasing power (purch_power_zt)
Dependent variable: public transport index (fare_acc_zt)
The sample is stripped of outliers (see boxplot above).
	SDEM	SEM	SLX	OLS
(Intercept)	-0.07	-0.07	-0.02	-0.02
	(0.05)	(0.05)	(0.02)	(0.02)
purch_power_zt	-0.10^*	-0.09^*	-0.09	-0.05^*
	(0.04)	(0.04)	(0.08)	(0.02)
lag.purch_power_zt	0.03		0.04
	(0.06)		(0.08)
lambda	0.84^***	0.84^***
	(0.01)	(0.01)
Num. obs.	2655	2655		2655
Parameters	5	4
Log Likelihood	-1746.76	-1746.86	-3175.94
AIC (Linear model)	6359.88	6358.12
AIC (Spatial model)	3503.52	3501.72
LR test: statistic	2858.36	2858.40
LR test: p-value	0.00	0.00
R²			0.00	0.00
Adj. R²			0.00	0.00
Pseudo-R²	0.72	0.72
Sigma			0.80
Statistic			2.31
P Value			0.10
DF			2.00
AIC			6359.88
BIC			6383.42
Deviance			1700.61
DF Resid.			2652
nobs			2655
* p < 0.05; p < 0.01; * p < 0.001

Hypothesis U3 states that low purchasing power predicts high fare accessibility. As the beta coefficient of the bivariate regression shows, the purchasing power (purch_power_zt) exerts an effect of -0.1 on the public transport index (fare_acc_zt). This means that for an increase of 1 standard deviation of purch_power_zt at a given stop, we should expect a decrease of 0.1 standard deviations of fare_acc_zt. As the variables are normalized, we must not interpret them in their original units.

The bivariate regression estimator is negative. The regression estimate is *significant.

Hence, hypothesis U3 is verified.

U4: Short distance to the next centre predicts high fare accessibility

#define the variables for this Pseudo-R^2 and for the caption
regr <- "fare_acc_zt"
dv <- regr
dv_en <- t2_en
iv <- "dist_centre_zt"
iv_en <- km_zentrum_en
iv_unit_en <- km_zentrum_unit_en
hyp <- 4

hyp_sdem <- paste0("u", hyp, "_sdem_zt")
hyp_sem <- paste0("u", hyp, "_sem_zt")
hyp_slx <- paste0("u", hyp, "_slx_zt")
hyp_ols <- paste0("u", hyp, "_ols_zt")


 ##### Calculate Pseudo-R^2 for SDEM and SEM to list them in the table
# Note that the formula points to the dynamic r variable, i.e., oevx/t2/whatever I assign as the dependent variable in the file 1_make_models.R! 
#Furthermore, the get(...) calls dynamically point to the variables I have defined above, depending on my dissertation hypotheses

#SDEM
psr2_sdem <- 1 - (get(hyp_sdem)$SSE / (var(stops_u[[regr]]) * (length(stops_u[[regr]]) - 1)))
psr2_sdem <- round(psr2_sdem,2)


##### SEM
psr2_sem <- 1 - (get(hyp_sem)$SSE / (var(stops_u[[regr]]) * (length(stops_u[[regr]]) - 1)))
psr2_sem <- round(psr2_sem,2)

# extract p value for hyp testing
#hyp_p <- coef(summary(get(hyp_sdem)))[iv, "Pr(>|z|)"]  
hyp_p <- coef(summary(get(hyp_sdem)))[which(rownames(coef(summary(get(hyp_sdem)))) == iv), "Pr(>|z|)"]

# Generate the htmlreg table with custom rows for Pseudo-R^2^ 
model_table <- texreg::htmlreg(
  list(get(hyp_sdem), get(hyp_sem), get(hyp_slx), get(hyp_ols)),
  doctype = FALSE,
  custom.model.names = model_names,
  custom.gof.rows = list('Pseudo-R^2^' = c(psr2_sdem, psr2_sem, '', '')), # this one inserts two values and two empty strings into the table
   reorder.gof = c(2:10, 1, 11:19), # reorder the R^2^ rows, i.e., insert row number 1  in-between row 10 (last R^2^ of the regular models) and row 11.
  caption = paste0("Bivariate regression estimates of the four models<br>
                   Independent variable: ", iv_en, " (", iv, ")",
                   "<br>
                   Dependent variable: ", dv_en, " (", dv, ")<br>
                   The sample is stripped of outliers [(see boxplot above)](#bplot_u)."),
  custom.note = cap_sig)


# Create a kableExtra table and set styles for all columns
model_table_styled <- kable(
  model_table,
  format = "html",
  escape = FALSE
) %>%
  kable_styling(full_width = FALSE) %>%  # Set to TRUE if you want full-width
  column_spec(1, extra_css = "text-align: right;")  # Align all columns to the right

# Print or use the modified HTML table
print(model_table_styled)

x

Bivariate regression estimates of the four models
Independent variable: distance to the next middle level center (dist_centre_zt)
Dependent variable: fare accessibility (fare_acc_zt)
The sample is stripped of outliers (see boxplot above).
	SDEM	SEM	SLX	OLS
(Intercept)	-0.41^***	-0.36^***	-0.40^***	-0.40^***
	(0.08)	(0.08)	(0.04)	(0.04)
dist_centre_zt	-0.24	-0.42^***	-0.68^*	-0.49^***
	(0.15)	(0.09)	(0.28)	(0.04)
lag.dist_centre_zt	-0.25		0.20
	(0.17)		(0.29)
lambda	0.83^***	0.83^***
	(0.01)	(0.01)
Num. obs.	2655	2655		2655
Parameters	5	4
Log Likelihood	-1739.25	-1740.30	-3114.76
AIC (Linear model)	6237.53	6235.99
AIC (Spatial model)	3488.51	3488.60
LR test: statistic	2751.02	2749.39
LR test: p-value	0.00	0.00
R²			0.05	0.05
Adj. R²			0.05	0.05
Pseudo-R²	0.72	0.72
Sigma			0.78
Statistic			64.96
P Value			0.00
DF			2.00
AIC			6237.53
BIC			6261.06
Deviance			1624.02
DF Resid.			2652
nobs			2655
* p < 0.05; p < 0.01; * p < 0.001

Hypothesis U4 states that a short distance to the next centre (=high spatial centrality) predicts high fare accessibility. As the beta coefficient of the bivariate regression shows, the distance to the next middle level center (dist_centre_zt) exerts an effect of -0.24 on the fare accessibility (fare_acc_zt). This means that for an increase of 1 standard deviation of dist_centre_zt at a given stop, we should expect a decrease of 0.24 standard deviations of fare_acc_zt. As the variables are normalized, we must not interpret them in their original units.

The bivariate regression estimator is negative. The regression estimate is, however, not significant.

Hence, hypothesis U4 is falsified and needs to be dismissed.

U5: High car availability predicts low fare accessibility

#define the variables for this Pseudo-R^2 and for the caption
regr <- "fare_acc_zt"
dv <- regr
dv_en <- t2_en
iv <- "cars_cap_zt"
iv_en <- pkw_en
iv_unit_en <- pkw_unit_en
hyp <- 5

hyp_sdem <- paste0("u", hyp, "_sdem_zt")
hyp_sem <- paste0("u", hyp, "_sem_zt")
hyp_slx <- paste0("u", hyp, "_slx_zt")
hyp_ols <- paste0("u", hyp, "_ols_zt")


 ##### Calculate Pseudo-R^2 for SDEM and SEM to list them in the table
# Note that the formula points to the dynamic r variable, i.e., oevx/t2/whatever I assign as the dependent variable in the file 1_make_models.R! 
#Furthermore, the get(...) calls dynamically point to the variables I have defined above, depending on my dissertation hypotheses

#SDEM
psr2_sdem <- 1 - (get(hyp_sdem)$SSE / (var(stops_u[[regr]]) * (length(stops_u[[regr]]) - 1)))
psr2_sdem <- round(psr2_sdem,2)


##### SEM
psr2_sem <- 1 - (get(hyp_sem)$SSE / (var(stops_u[[regr]]) * (length(stops_u[[regr]]) - 1)))
psr2_sem <- round(psr2_sem,2)

# extract p value for hyp testing
#hyp_p <- coef(summary(get(hyp_sdem)))[iv, "Pr(>|z|)"]  
hyp_p <- coef(summary(get(hyp_sdem)))[which(rownames(coef(summary(get(hyp_sdem)))) == iv), "Pr(>|z|)"]

# Generate the htmlreg table with custom rows for Pseudo-R^2^ 
model_table <- texreg::htmlreg(
  list(get(hyp_sdem), get(hyp_sem), get(hyp_slx), get(hyp_ols)),
  doctype = FALSE,
  custom.model.names = model_names,
  custom.gof.rows = list('Pseudo-R^2^' = c(psr2_sdem, psr2_sem, '', '')), # this one inserts two values and two empty strings into the table
   reorder.gof = c(2:10, 1, 11:19), # reorder the R^2^ rows, i.e., insert row number 1  in-between row 10 (last R^2^ of the regular models) and row 11.
  caption = paste0("Bivariate regression estimates of the four models<br>
                   Independent variable: ", iv_en, " (", iv, ")",
                   "<br>
                   Dependent variable: ", dv_en, " (", dv, ")<br>
                   The sample is stripped of outliers [(see boxplot above)](#bplot_u)."),
  custom.note = cap_sig)


# Create a kableExtra table and set styles for all columns
model_table_styled <- kable(
  model_table,
  format = "html",
  escape = FALSE
) %>%
  kable_styling(full_width = FALSE) %>%  # Set to TRUE if you want full-width
  column_spec(1, extra_css = "text-align: right;")  # Align all columns to the right

# Print or use the modified HTML table
print(model_table_styled)

x

Bivariate regression estimates of the four models
Independent variable: car availability (cars_cap_zt)
Dependent variable: fare accessibility (fare_acc_zt)
The sample is stripped of outliers (see boxplot above).
	SDEM	SEM	SLX	OLS
(Intercept)	-0.41^***	-0.27^***	-0.46^***	-0.42^***
	(0.06)	(0.05)	(0.02)	(0.02)
cars_cap_zt	-0.24^***	-0.39^***	-0.15	-0.70^***
	(0.06)	(0.05)	(0.12)	(0.03)
lag.cars_cap_zt	-0.41^***		-0.60^***
	(0.09)		(0.13)
lambda	0.81^***	0.82^***
	(0.01)	(0.01)
Num. obs.	2655	2655		2655
Parameters	5	4
Log Likelihood	-1715.70	-1727.03	-2921.08
AIC (Linear model)	5850.16	5870.31
AIC (Spatial model)	3441.40	3462.06
LR test: statistic	2410.76	2410.25
LR test: p-value	0.00	0.00
R²			0.18	0.17
Adj. R²			0.18	0.17
Pseudo-R²	0.72	0.72
Sigma			0.73
Statistic			283.45
P Value			0.00
DF			2.00
AIC			5850.16
BIC			5873.70
Deviance			1403.55
DF Resid.			2652
nobs			2655
* p < 0.05; p < 0.01; * p < 0.001

Hypothesis U5 states that high car availability predicts low fare accessibility. As the beta coefficient of the bivariate regression shows, the car availability (cars_cap_zt) exerts an effect of -0.24 on the fare accessibility (fare_acc_zt). This means that for an increase of 1 standard deviation of cars_cap_zt at a given stop, we should expect a decrease of 0.24 standard deviations of fare_acc_zt. As the variables are normalized, we must not interpret them in their original units.

The bivariate regression estimator is negative. The regression estimate is ***significant.

Hence, hypothesis U5 is verified.

Five hypotheses for rural areas – bivariate models

R1: Low purchasing power predicts low public transportation service

#define the variables for this Pseudo-R^2 and for the caption
regr <- "pt_index_zt"
dv <- regr
dv_en <- oevx_en
iv <- "purch_power_zt"
iv_en <- kk_en
iv_unit_en <- kk_unit_en
hyp <- 1

hyp_sdem <- paste0("r", hyp, "_sdem_zt")
hyp_sem <- paste0("r", hyp, "_sem_zt")
hyp_slx <- paste0("r", hyp, "_slx_zt")
hyp_ols <- paste0("r", hyp, "_ols_zt")


 ##### Calculate Pseudo-R^2 for SDEM and SEM to list them in the table
# Note that the formula points to the dynamic r variable, i.e., oevx/t2/whatever I assign as the dependent variable in the file 1_make_models.R! 
#Furthermore, the get(...) calls dynamically point to the variables I have defined above, depending on my dissertation hypotheses

#SDEM
psr2_sdem <- 1 - (get(hyp_sdem)$SSE / (var(stops_u[[regr]]) * (length(stops_u[[regr]]) - 1)))
psr2_sdem <- round(psr2_sdem,2)


##### SEM
psr2_sem <- 1 - (get(hyp_sem)$SSE / (var(stops_u[[regr]]) * (length(stops_u[[regr]]) - 1)))
psr2_sem <- round(psr2_sem,2)

# extract p value for hyp testing
#hyp_p <- coef(summary(get(hyp_sdem)))[iv, "Pr(>|z|)"]  
hyp_p <- coef(summary(get(hyp_sdem)))[which(rownames(coef(summary(get(hyp_sdem)))) == iv), "Pr(>|z|)"]

# Generate the htmlreg table with custom rows for Pseudo-R^2^ 
model_table <- texreg::htmlreg(
  list(get(hyp_sdem), get(hyp_sem), get(hyp_slx), get(hyp_ols)),
  doctype = FALSE,
  custom.model.names = model_names,
  custom.gof.rows = list('Pseudo-R^2^' = c(psr2_sdem, psr2_sem, '', '')), # this one inserts two values and two empty strings into the table
   reorder.gof = c(2:10, 1, 11:19), # reorder the R^2^ rows, i.e., insert row number 1  in-between row 10 (last R^2^ of the regular models) and row 11.
  caption = paste0("Bivariate regression estimates of the four models<br>
                   Independent variable: ", iv_en, " (", iv, ")",
                   "<br>
                   Dependent variable: ", dv_en, " (", dv, ")<br>
                   The sample is stripped of outliers [(see boxplot above)](#bplot_r)."),
  custom.note = cap_sig)


# Create a kableExtra table and set styles for all columns
model_table_styled <- kable(
  model_table,
  format = "html",
  escape = FALSE
) %>%
  kable_styling(full_width = FALSE) %>%  # Set to TRUE if you want full-width
  column_spec(1, extra_css = "text-align: right;")  # Align all columns to the right

# Print or use the modified HTML table
print(model_table_styled)

x

Bivariate regression estimates of the four models
Independent variable: purchasing power (purch_power_zt)
Dependent variable: public transport index (pt_index_zt)
The sample is stripped of outliers (see boxplot above).
	SDEM	SEM	SLX	OLS
(Intercept)	-0.23^***	-0.23^***	-0.16^***	-0.16^***
	(0.01)	(0.01)	(0.01)	(0.01)
purch_power_zt	0.00	0.02	-0.02	0.06^***
	(0.01)	(0.01)	(0.02)	(0.01)
lag.purch_power_zt	0.04^**		0.08^***
	(0.01)		(0.02)
lambda	0.67^***	0.67^***
	(0.01)	(0.01)
Num. obs.	3953	3953		3953
Parameters	5	4
Log Likelihood	-465.52	-470.31	-1924.95
AIC (Linear model)	3857.90	3868.38
AIC (Spatial model)	941.04	948.61
LR test: statistic	2918.86	2921.77
LR test: p-value	0.00	0.00
R²			0.02	0.02
Adj. R²			0.02	0.02
Pseudo-R²	0.54	0.54
Sigma			0.39
Statistic			39.52
P Value			0.00
DF			2.00
AIC			3857.90
BIC			3883.03
Deviance			612.94
DF Resid.			3950
nobs			3953
* p < 0.05; p < 0.01; * p < 0.001

Hypothesis R1 states that low purchasing power predicts low public transportation service. As the beta coefficient of the bivariate regression shows, the purchasing power (purch_power_zt) exerts an effect of 0 on the public transport index (pt_index_zt). This means that for an increase of 1 standard deviation of purch_power_zt at a given stop, we should expect an increase of 0 standard deviations of pt_index_zt. As the variables are normalized, we must not interpret them in their original units.

The bivariate regression estimator is positive. The regression estimate is, however, not significant.

Hence, hypothesis R1 is falsified and needs to be dismissed.

R2: High density predicts high purchasing power

#define the variables for this Pseudo-R^2 and for the caption
regr <- "purch_power_zt"
dv <- regr
dv_en <- oevx_en
iv <- "pop_dens_zt"
iv_en <- pop_ha_en
iv_unit_en <- pop_ha_unit_en
hyp <- 2

hyp_sdem <- paste0("r", hyp, "_sdem_zt")
hyp_sem <- paste0("r", hyp, "_sem_zt")
hyp_slx <- paste0("r", hyp, "_slx_zt")
hyp_ols <- paste0("r", hyp, "_ols_zt")


 ##### Calculate Pseudo-R^2 for SDEM and SEM to list them in the table
# Note that the formula points to the dynamic r variable, i.e., oevx/t2/whatever I assign as the dependent variable in the file 1_make_models.R! 
#Furthermore, the get(...) calls dynamically point to the variables I have defined above, depending on my dissertation hypotheses

#SDEM
psr2_sdem <- 1 - (get(hyp_sdem)$SSE / (var(stops_u[[regr]]) * (length(stops_u[[regr]]) - 1)))
psr2_sdem <- round(psr2_sdem,2)


##### SEM
psr2_sem <- 1 - (get(hyp_sem)$SSE / (var(stops_u[[regr]]) * (length(stops_u[[regr]]) - 1)))
psr2_sem <- round(psr2_sem,2)

# extract p value for hyp testing
#hyp_p <- coef(summary(get(hyp_sdem)))[iv, "Pr(>|z|)"]  
hyp_p <- coef(summary(get(hyp_sdem)))[which(rownames(coef(summary(get(hyp_sdem)))) == iv), "Pr(>|z|)"]

# Generate the htmlreg table with custom rows for Pseudo-R^2^ 
model_table <- texreg::htmlreg(
  list(get(hyp_sdem), get(hyp_sem), get(hyp_slx), get(hyp_ols)),
  doctype = FALSE,
  custom.model.names = model_names,
  custom.gof.rows = list('Pseudo-R^2^' = c(psr2_sdem, psr2_sem, '', '')), # this one inserts two values and two empty strings into the table
   reorder.gof = c(2:10, 1, 11:19), # reorder the R^2^ rows, i.e., insert row number 1  in-between row 10 (last R^2^ of the regular models) and row 11.
  caption = paste0("Bivariate regression estimates of the four models<br>
                   Independent variable: ", iv_en, " (", iv, ")",
                   "<br>
                   Dependent variable: ", dv_en, " (", dv, ")<br>
                   The sample is stripped of outliers [(see boxplot above)](#bplot_r)."),
  custom.note = cap_sig)


# Create a kableExtra table and set styles for all columns
model_table_styled <- kable(
  model_table,
  format = "html",
  escape = FALSE
) %>%
  kable_styling(full_width = FALSE) %>%  # Set to TRUE if you want full-width
  column_spec(1, extra_css = "text-align: right;")  # Align all columns to the right

# Print or use the modified HTML table
print(model_table_styled)

x

Bivariate regression estimates of the four models
Independent variable: residential density (pop_dens_zt)
Dependent variable: public transport index (purch_power_zt)
The sample is stripped of outliers (see boxplot above).
	SDEM	SEM	SLX	OLS
(Intercept)	-0.20^***	-0.23^***	0.21^***	0.19^***
	(0.04)	(0.04)	(0.04)	(0.04)
pop_dens_zt	-0.27^***	-0.28^***	0.28^*	0.60^***
	(0.07)	(0.07)	(0.12)	(0.07)
lag.pop_dens_zt	0.11		0.39^**
	(0.07)		(0.12)
lambda	0.81^***	0.81^***
	(0.00)	(0.00)
Num. obs.	3953	3953		3953
Parameters	5	4
Log Likelihood	-2212.56	-2213.74	-5064.89
AIC (Linear model)	10137.79	10146.26
AIC (Spatial model)	4435.11	4435.49
LR test: statistic	5704.68	5712.77
LR test: p-value	0.00	0.00
R²			0.02	0.02
Adj. R²			0.02	0.02
Pseudo-R²	0.63	0.63
Sigma			0.87
Statistic			39.27
P Value			0.00
DF			2.00
AIC			10137.79
BIC			10162.92
Deviance			3001.63
DF Resid.			3950
nobs			3953
* p < 0.05; p < 0.01; * p < 0.001

Hypothesis R2 states that high residential density predicts high purchasing power. As the beta coefficient of the bivariate regression shows, the residential density (pop_dens_zt) exerts an effect of -0.27 on the public transport index (purch_power_zt). This means that for an increase of 1 standard deviation of pop_dens_zt at a given stop, we should expect a decrease of 0.27 standard deviations of purch_power_zt. As the variables are normalized, we must not interpret them in their original units.

The bivariate regression estimator is negative. The regression estimate is ***significant.

Hence, hypothesis R2 is falsified and needs to be dismissed.

R3: Low purchasing power predicts low fare accessibility

#define the variables for this Pseudo-R^2 and for the caption
regr <- "fare_acc_zt"
dv <- regr
dv_en <- oevx_en
iv <- "purch_power_zt"
iv_en <- kk_en
iv_unit_en <- kk_unit_en
hyp <- 3

hyp_sdem <- paste0("r", hyp, "_sdem_zt")
hyp_sem <- paste0("r", hyp, "_sem_zt")
hyp_slx <- paste0("r", hyp, "_slx_zt")
hyp_ols <- paste0("r", hyp, "_ols_zt")


 ##### Calculate Pseudo-R^2 for SDEM and SEM to list them in the table
# Note that the formula points to the dynamic r variable, i.e., oevx/t2/whatever I assign as the dependent variable in the file 1_make_models.R! 
#Furthermore, the get(...) calls dynamically point to the variables I have defined above, depending on my dissertation hypotheses

#SDEM
psr2_sdem <- 1 - (get(hyp_sdem)$SSE / (var(stops_u[[regr]]) * (length(stops_u[[regr]]) - 1)))
psr2_sdem <- round(psr2_sdem,2)


##### SEM
psr2_sem <- 1 - (get(hyp_sem)$SSE / (var(stops_u[[regr]]) * (length(stops_u[[regr]]) - 1)))
psr2_sem <- round(psr2_sem,2)

# extract p value for hyp testing
#hyp_p <- coef(summary(get(hyp_sdem)))[iv, "Pr(>|z|)"]  
hyp_p <- coef(summary(get(hyp_sdem)))[which(rownames(coef(summary(get(hyp_sdem)))) == iv), "Pr(>|z|)"]

# Generate the htmlreg table with custom rows for Pseudo-R^2^ 
model_table <- texreg::htmlreg(
  list(get(hyp_sdem), get(hyp_sem), get(hyp_slx), get(hyp_ols)),
  doctype = FALSE,
  custom.model.names = model_names,
  custom.gof.rows = list('Pseudo-R^2^' = c(psr2_sdem, psr2_sem, '', '')), # this one inserts two values and two empty strings into the table
   reorder.gof = c(2:10, 1, 11:19), # reorder the R^2^ rows, i.e., insert row number 1  in-between row 10 (last R^2^ of the regular models) and row 11.
  caption = paste0("Bivariate regression estimates of the four models<br>
                   Independent variable: ", iv_en, " (", iv, ")",
                   "<br>
                   Dependent variable: ", dv_en, " (", dv, ")<br>
                   The sample is stripped of outliers [(see boxplot above)](#bplot_r)."),
  custom.note = cap_sig)


# Create a kableExtra table and set styles for all columns
model_table_styled <- kable(
  model_table,
  format = "html",
  escape = FALSE
) %>%
  kable_styling(full_width = FALSE) %>%  # Set to TRUE if you want full-width
  column_spec(1, extra_css = "text-align: right;")  # Align all columns to the right

# Print or use the modified HTML table
print(model_table_styled)

x

Bivariate regression estimates of the four models
Independent variable: purchasing power (purch_power_zt)
Dependent variable: public transport index (fare_acc_zt)
The sample is stripped of outliers (see boxplot above).
	SDEM	SEM	SLX	OLS
(Intercept)	-0.15^***	-0.16^***	0.13^***	0.12^***
	(0.01)	(0.01)	(0.01)	(0.01)
purch_power_zt	0.15^***	0.13^***	0.15^**	0.46^***
	(0.01)	(0.01)	(0.05)	(0.02)
lag.purch_power_zt	0.13^***		0.33^***
	(0.02)		(0.05)
lambda	0.90^***	0.90^***
	(0.00)	(0.00)
Num. obs.	3953	3953		3953
Parameters	5	4
Log Likelihood	-770.89	-796.03	-5175.13
AIC (Linear model)	10358.26	10397.47
AIC (Spatial model)	1551.77	1600.06
LR test: statistic	8808.49	8799.41
LR test: p-value	0.00	0.00
R²			0.18	0.17
Adj. R²			0.17	0.17
Pseudo-R²	0.91	0.91
Sigma			0.90
Statistic			419.29
P Value			0.00
DF			2.00
AIC			10358.26
BIC			10383.39
Deviance			3173.79
DF Resid.			3950
nobs			3953
* p < 0.05; p < 0.01; * p < 0.001

Hypothesis R3 states that low purchasing power predicts low fare accessibility. As the beta coefficient of the bivariate regression shows, the purchasing power (purch_power_zt) exerts an effect of 0.15 on the public transport index (fare_acc_zt). This means that for an increase of 1 standard deviation of purch_power_zt at a given stop, we should expect an increase of 0.15 standard deviations of fare_acc_zt. As the variables are normalized, we must not interpret them in their original units.

The bivariate regression estimator is positive. The regression estimate is ***significant.

Hence, hypothesis R3 is verified.

R4: Short distance to the next centre predicts high fare accessibility

#define the variables for this Pseudo-R^2 and for the caption
regr <- "fare_acc_zt"
dv <- regr
dv_en <- t2_en
iv <- "dist_centre_zt"
iv_en <- km_zentrum_en
iv_unit_en <- km_zentrum_unit_en
hyp <- 4

hyp_sdem <- paste0("r", hyp, "_sdem_zt")
hyp_sem <- paste0("r", hyp, "_sem_zt")
hyp_slx <- paste0("r", hyp, "_slx_zt")
hyp_ols <- paste0("r", hyp, "_ols_zt")


 ##### Calculate Pseudo-R^2 for SDEM and SEM to list them in the table
# Note that the formula points to the dynamic r variable, i.e., oevx/t2/whatever I assign as the dependent variable in the file 1_make_models.R! 
#Furthermore, the get(...) calls dynamically point to the variables I have defined above, depending on my dissertation hypotheses

#SDEM
psr2_sdem <- 1 - (get(hyp_sdem)$SSE / (var(stops_u[[regr]]) * (length(stops_u[[regr]]) - 1)))
psr2_sdem <- round(psr2_sdem,2)


##### SEM
psr2_sem <- 1 - (get(hyp_sem)$SSE / (var(stops_u[[regr]]) * (length(stops_u[[regr]]) - 1)))
psr2_sem <- round(psr2_sem,2)

# extract p value for hyp testing
#hyp_p <- coef(summary(get(hyp_sdem)))[iv, "Pr(>|z|)"]  
hyp_p <- coef(summary(get(hyp_sdem)))[which(rownames(coef(summary(get(hyp_sdem)))) == iv), "Pr(>|z|)"]

# Generate the htmlreg table with custom rows for Pseudo-R^2^ 
model_table <- texreg::htmlreg(
  list(get(hyp_sdem), get(hyp_sem), get(hyp_slx), get(hyp_ols)),
  doctype = FALSE,
  custom.model.names = model_names,
  custom.gof.rows = list('Pseudo-R^2^' = c(psr2_sdem, psr2_sem, '', '')), # this one inserts two values and two empty strings into the table
   reorder.gof = c(2:10, 1, 11:19), # reorder the R^2^ rows, i.e., insert row number 1  in-between row 10 (last R^2^ of the regular models) and row 11.
  caption = paste0("Bivariate regression estimates of the four models<br>
                   Independent variable: ", iv_en, " (", iv, ")",
                   "<br>
                   Dependent variable: ", dv_en, " (", dv, ")<br>
                   The sample is stripped of outliers [(see boxplot above)](#bplot_r)."),
  custom.note = cap_sig)


# Create a kableExtra table and set styles for all columns
model_table_styled <- kable(
  model_table,
  format = "html",
  escape = FALSE
) %>%
  kable_styling(full_width = FALSE) %>%  # Set to TRUE if you want full-width
  column_spec(1, extra_css = "text-align: right;")  # Align all columns to the right

# Print or use the modified HTML table
print(model_table_styled)

x

Bivariate regression estimates of the four models
Independent variable: distance to the next middle level center (dist_centre_zt)
Dependent variable: fare accessibility (fare_acc_zt)
The sample is stripped of outliers (see boxplot above).
	SDEM	SEM	SLX	OLS
(Intercept)	-0.17^***	-0.16^***	0.10^***	0.10^***
	(0.02)	(0.02)	(0.02)	(0.02)
dist_centre_zt	-0.02	-0.02	-0.12	-0.06^***
	(0.02)	(0.01)	(0.08)	(0.02)
lag.dist_centre_zt	-0.01		0.06
	(0.03)		(0.08)
lambda	0.90^***	0.90^***
	(0.00)	(0.00)
Num. obs.	3953	3953		3953
Parameters	5	4
Log Likelihood	-865.82	-865.90	-5548.66
AIC (Linear model)	11105.33	11104.00
AIC (Spatial model)	1741.64	1739.80
LR test: statistic	9365.69	9366.19
LR test: p-value	0.00	0.00
R²			0.00	0.00
Adj. R²			0.00	0.00
Pseudo-R²	0.91	0.91
Sigma			0.99
Statistic			6.98
P Value			0.00
DF			2.00
AIC			11105.33
BIC			11130.46
Deviance			3834.03
DF Resid.			3950
nobs			3953
* p < 0.05; p < 0.01; * p < 0.001

Hypothesis R4 states that a short distance to the next centre (=high spatial centrality) predicts high fare accessibility. As the beta coefficient of the bivariate regression shows, the distance to the next middle level center (dist_centre_zt) exerts an effect of -0.02 on the fare accessibility (fare_acc_zt). This means that for an increase of 1 standard deviation of dist_centre_zt at a given stop, we should expect a decrease of 0.02 standard deviations of fare_acc_zt. As the variables are normalized, we must not interpret them in their original units.

The bivariate regression estimator is negative. The regression estimate is, however, not significant.

Hence, hypothesis R4 is falsified and needs to be dismissed.

R5: High car availability predicts low fare accessibility

#define the variables for this Pseudo-R^2 and for the caption
regr <- "fare_acc_zt"
dv <- regr
dv_en <- t2_en
iv <- "cars_cap_zt"
iv_en <- pkw_en
iv_unit_en <- pkw_unit_en
hyp <- 5

hyp_sdem <- paste0("r", hyp, "_sdem_zt")
hyp_sem <- paste0("r", hyp, "_sem_zt")
hyp_slx <- paste0("r", hyp, "_slx_zt")
hyp_ols <- paste0("r", hyp, "_ols_zt")


 ##### Calculate Pseudo-R^2 for SDEM and SEM to list them in the table
# Note that the formula points to the dynamic r variable, i.e., oevx/t2/whatever I assign as the dependent variable in the file 1_make_models.R! 
#Furthermore, the get(...) calls dynamically point to the variables I have defined above, depending on my dissertation hypotheses

#SDEM
psr2_sdem <- 1 - (get(hyp_sdem)$SSE / (var(stops_u[[regr]]) * (length(stops_u[[regr]]) - 1)))
psr2_sdem <- round(psr2_sdem,2)


##### SEM
psr2_sem <- 1 - (get(hyp_sem)$SSE / (var(stops_u[[regr]]) * (length(stops_u[[regr]]) - 1)))
psr2_sem <- round(psr2_sem,2)

# extract p value for hyp testing
#hyp_p <- coef(summary(get(hyp_sdem)))[iv, "Pr(>|z|)"]  
hyp_p <- coef(summary(get(hyp_sdem)))[which(rownames(coef(summary(get(hyp_sdem)))) == iv), "Pr(>|z|)"]

# Generate the htmlreg table with custom rows for Pseudo-R^2^ 
model_table <- texreg::htmlreg(
  list(get(hyp_sdem), get(hyp_sem), get(hyp_slx), get(hyp_ols)),
  doctype = FALSE,
  custom.model.names = model_names,
  custom.gof.rows = list('Pseudo-R^2^' = c(psr2_sdem, psr2_sem, '', '')), # this one inserts two values and two empty strings into the table
   reorder.gof = c(2:10, 1, 11:19), # reorder the R^2^ rows, i.e., insert row number 1  in-between row 10 (last R^2^ of the regular models) and row 11.
  caption = paste0("Bivariate regression estimates of the four models<br>
                   Independent variable: ", iv_en, " (", iv, ")",
                   "<br>
                   Dependent variable: ", dv_en, " (", dv, ")<br>
                   The sample is stripped of outliers [(see boxplot above)](#bplot_r)."),
  custom.note = cap_sig)

# Create a kableExtra table and set styles for all columns
model_table_styled <- kable(
  model_table,
  format = "html",
  escape = FALSE
) %>%
  kable_styling(full_width = FALSE) %>%  # Set to TRUE if you want full-width
  column_spec(1, extra_css = "text-align: right;")  # Align all columns to the right

# Print or use the modified HTML table
print(model_table_styled)

x

Bivariate regression estimates of the four models
Independent variable: car availability (cars_cap_zt)
Dependent variable: fare accessibility (fare_acc_zt)
The sample is stripped of outliers (see boxplot above).
	SDEM	SEM	SLX	OLS
(Intercept)	-0.20^***	-0.18^***	0.04^*	0.05^*
	(0.02)	(0.02)	(0.02)	(0.02)
cars_cap_zt	0.10^***	0.05^***	0.06	0.20^***
	(0.02)	(0.01)	(0.07)	(0.04)
lag.cars_cap_zt	0.11^***		0.19^*
	(0.03)		(0.08)
lambda	0.90^***	0.90^***
	(0.00)	(0.00)
Num. obs.	3953	3953		3953
Parameters	5	4
Log Likelihood	-850.61	-860.22	-5539.52
AIC (Linear model)	11087.04	11091.25
AIC (Spatial model)	1711.22	1728.44
LR test: statistic	9377.81	9364.80
LR test: p-value	0.00	0.00
R²			0.01	0.01
Adj. R²			0.01	0.01
Pseudo-R²	0.91	0.91
Sigma			0.98
Statistic			16.18
P Value			0.00
DF			2.00
AIC			11087.04
BIC			11112.17
Deviance			3816.33
DF Resid.			3950
nobs			3953
* p < 0.05; p < 0.01; * p < 0.001

Hypothesis R5 states that high car availability predicts low fare accessibility. As the beta coefficient of the bivariate regression shows, the car availability (cars_cap_zt) exerts an effect of 0.1 on the fare accessibility (fare_acc_zt). This means that for an increase of 1 standard deviation of cars_cap_zt at a given stop, we should expect an increase of 0.1 standard deviations of fare_acc_zt. As the variables are normalized, we must not interpret them in their original units.

The bivariate regression estimator is positive. The regression estimate is ***significant.

Hence, hypothesis R5 is falsified and needs to be dismissed.

One general hypothesis – multivariate model

# define text chunks that explain G1 significance depending on the p values in the normalized model


#define model
hyp_sdem <- "sdem_zt"

# extract p value for hyp testing
g1_p_pop_dens <- coef(summary(get(hyp_sdem)))[which(rownames(coef(summary(get(hyp_sdem)))) == "pop_dens_zt"), "Pr(>|z|)"]
g1_p_travel_time <- coef(summary(get(hyp_sdem)))[which(rownames(coef(summary(get(hyp_sdem)))) == "travel_time_zt"), "Pr(>|z|)"]
g1_p_rent <- coef(summary(get(hyp_sdem)))[which(rownames(coef(summary(get(hyp_sdem)))) == "rent_zt"), "Pr(>|z|)"]
g1_p_purch_power <- coef(summary(get(hyp_sdem)))[which(rownames(coef(summary(get(hyp_sdem)))) == "purch_power_zt"), "Pr(>|z|)"]
g1_p_cars_cap <- coef(summary(get(hyp_sdem)))[which(rownames(coef(summary(get(hyp_sdem)))) == "cars_cap_zt"), "Pr(>|z|)"]
g1_p_dist_centre <- coef(summary(get(hyp_sdem)))[which(rownames(coef(summary(get(hyp_sdem)))) == "dist_centre_zt"), "Pr(>|z|)"]
g1_p_pt_index <- coef(summary(get(hyp_sdem)))[which(rownames(coef(summary(get(hyp_sdem)))) == "pt_index_zt"), "Pr(>|z|)"]
g1_p_dwelling_area <- coef(summary(get(hyp_sdem)))[which(rownames(coef(summary(get(hyp_sdem)))) == "dwelling_area_zt"), "Pr(>|z|)"]


# write beta estimators into new variables
g1_b_pop_dens <- round(get("sdem_zt")[["coefficients"]][["pop_dens_zt"]],2)
g1_b_travel_time <- round(get("sdem_zt")[["coefficients"]][["travel_time_zt"]],2)
g1_b_rent <- round(get("sdem_zt")[["coefficients"]][["rent_zt"]],2)
g1_b_purch_power <- round(get("sdem_zt")[["coefficients"]][["purch_power_zt"]],2)
g1_b_cars_cap <- round(get("sdem_zt")[["coefficients"]][["cars_cap_zt"]],2)
g1_b_dist_centre <- round(get("sdem_zt")[["coefficients"]][["dist_centre_zt"]],2)
g1_b_pt_index <- round(get("sdem_zt")[["coefficients"]][["pt_index_zt"]],2)
g1_b_dwelling_area <- round(get("sdem_zt")[["coefficients"]][["dwelling_area_zt"]],2)

# derive text chunk that can be inserted inline
if (g1_p_pop_dens < sig3b ){g1_pop_dens_sig <- sig3} else if (g1_p_pop_dens < sig2b ){g1_pop_dens_sig <- sig2}  else if (g1_p_pop_dens < sig1b ){g1_pop_dens_sig <- sig1}  else {g1_pop_dens_sig <- sig0}

if (g1_p_travel_time < sig3b ){g1_travel_time_sig <- sig3} else if (g1_p_travel_time < sig2b ){g1_travel_time_sig <- sig2}  else if (g1_p_travel_time < sig1b ){g1_travel_time_sig <- sig1}  else {g1_travel_time_sig <- sig0}

if (g1_p_rent < sig3b ){g1_rent_sig <- sig3} else if (g1_p_rent < sig2b ){g1_rent_sig <- sig2}  else if (g1_p_rent < sig1b ){g1_rent_sig <- sig1}  else {g1_rent_sig <- sig0}

if (g1_p_purch_power < sig3b ){g1_purch_power_sig <- sig3} else if (g1_p_purch_power < sig2b ){g1_purch_power_sig <- sig2}  else if (g1_p_purch_power < sig1b ){g1_purch_power_sig <- sig1}  else {g1_purch_power_sig <- sig0}

if (g1_p_cars_cap < sig3b ){g1_cars_cap_sig <- sig3} else if (g1_p_cars_cap < sig2b ){g1_cars_cap_sig <- sig2}  else if (g1_p_cars_cap < sig1b ){g1_cars_cap_sig <- sig1}  else {g1_cars_cap_sig <- sig0}

if (g1_p_dist_centre < sig3b ){g1_dist_centre_sig <- sig3} else if (g1_p_dist_centre < sig2b ){g1_dist_centre_sig <- sig2}  else if (g1_p_dist_centre < sig1b ){g1_dist_centre_sig <- sig1}  else {g1_dist_centre_sig <- sig0}

if (g1_p_pt_index < sig3b ){g1_pt_index_sig <- sig3} else if (g1_p_pt_index < sig2b ){g1_pt_index_sig <- sig2}  else if (g1_p_pt_index < sig1b ){g1_pt_index_sig <- sig1}  else {g1_pt_index_sig <- sig0}

if (g1_p_dwelling_area < sig3b ){g1_dwelling_area_sig <- sig3} else if (g1_p_dwelling_area < sig2b ){g1_dwelling_area_sig <- sig2}  else if (g1_p_dwelling_area < sig1b ){g1_dwelling_area_sig <- sig1}  else {g1_dwelling_area_sig <- sig0}

# find out the highest absolute beta
# (the case_when clauses exclude non-significant betas)
g1_b_max <- max(c(
  abs(case_when(
    g1_p_pop_dens < sig1b ~ g1_b_pop_dens,
    TRUE ~ NA_real_  # Ensure NA_real_ is used for numeric NAs
  )),
  abs(case_when(
    g1_p_travel_time < sig1b ~ g1_b_travel_time,
    TRUE ~ NA_real_
  )),
  abs(case_when(
    g1_p_rent < sig1b ~ g1_b_rent,
    TRUE ~ NA_real_
  )),
  abs(case_when(
    g1_p_purch_power < sig1b ~ g1_b_purch_power,
    TRUE ~ NA_real_
  )),
  abs(case_when(
    g1_p_cars_cap < sig1b ~ g1_b_cars_cap,
    TRUE ~ NA_real_
  )),
  abs(case_when(
    g1_p_dist_centre < sig1b ~ g1_b_dist_centre,
    TRUE ~ NA_real_
  )),
  abs(case_when(
    g1_p_pt_index < sig1b ~ g1_b_pt_index,
    TRUE ~ NA_real_
  )),
  abs(case_when(
    g1_p_dwelling_area < sig1b ~ g1_b_dwelling_area,
    TRUE ~ NA_real_
  ))
), na.rm = TRUE)  # Ensure NA values are removed when calculating the max


# determine if density is the highest i.e. if we can confirm hyp G1
if(g1_b_pop_dens == g1_b_max) {                   
g1_decision <- T
  } else {                        
  g1_decision <- F
}

G1: Density has the largest impact of all

Hypothesis G1 refers to the multivariate model that is estimated for normalized variables. G1 states that the absolute beta coefficient of density is greater than any other beta. G1 can be tested by comparing the coefficients in the summary table for normalized variables. These are:

Density: 0.09 ***significant
Purchasing power: 0.09 ***significant
Rent: 0.06 **significant
Car availability: 0.08 ***significant
Distance to Centre: -0.04 , however, not significant
PT index: 0.02 , however, not significant
Travel time: -0.11 ***significant
Dwelling area: -0.04 **significant

Note that we’re comparing absolute effects. In other words, positive and negative beta values are treated equally.

The comparison shows that population density does not have the highest absolute significant impact. Thus, Hypothesis G1 has to be dismissed.

(G2, G3: Two general hypotheses across spatial levels)

Hypotheses G2 and G3 compare the goodness-of-fit of the models across the spatial levels, which is not possible at this level of analysis. Therefore, G2 and G3 are being discussed in the results section of my thesis.

Fare Accessibility Model Report:VBB, Stop Level

Christoph Aberle, Hamburg University of Technology | ORCID 0000-0003-0982-4869Hamburg, May 2025

Aim and structure of this report

Overview of input variables

Whole dataset, including outliers

The plots below are interactive. The colours match the map colours in the result section of my thesis. Zoom and pan to explore the data! You can show and hide the violin plots by clicking the legend items.

Stripped of outliers

Stripped of outliers – aggregate table

Stripped of outliers – urban stops only

Stripped of outliers – rural stops only

Comparison of four model estimates

Interpretation of the estimates

SDEM

SEM

SLX

OLS

Comparison of four model estimates (normalized)

Autocorrelation

Moran’s I test of global autocorrelation

Parameters

Estimates

Moran’s I Plot

The plot below is interactive. It shows data for 7,581 stops (no outliers). Zoom and pan to explore them!

LISA test of local autocorrelation

The plot below shows data for 7,581 stops (no outliers).

Restrict models?

SDEM to SEM?

Parameters

Estimates

SDEM to SLX?

Parameters

Estimates

SDEM to OLS?

Parameters

Estimates

Goodness-of-fit

Breusch-Pagan Test for heteroskedasticity in the SDEM

Estimates

Spatial Hausman test for SEM

Estimates

Test of 10 + 1 hypotheses

Five hypotheses for urban areas – bivariate models

U1: Low purchasing power predicts high public transportation service

U2: High density predicts low purchasing power

U3: Low purchasing power predicts high fare accessibility

U4: Short distance to the next centre predicts high fare accessibility

U5: High car availability predicts low fare accessibility

Five hypotheses for rural areas – bivariate models

R1: Low purchasing power predicts low public transportation service

R2: High density predicts high purchasing power

R3: Low purchasing power predicts low fare accessibility

R4: Short distance to the next centre predicts high fare accessibility

R5: High car availability predicts low fare accessibility

One general hypothesis – multivariate model

G1: Density has the largest impact of all

(G2, G3: Two general hypotheses across spatial levels)

Fare Accessibility Model Report:
VBB, Stop Level

Christoph Aberle, Hamburg University of Technology | ORCID 0000-0003-0982-4869
Hamburg, May 2025