For our tutorial on Logistic Regression, let's use a famous dataset called wells
(Gelman & Hill, 2007), which is data from a survey of 3,200 residents in a small area of Bangladesh suffering from arsenic contamination of groundwater. Respondents with elevated arsenic levels in their wells had been encouraged to switch their water source to a safe public or private well in the nearby area and the survey was conducted several years later to learn which of the affected residents had switched wells. It has 3,200 observations and the following variables:
switch
– binary/dummy (0 or 1) for well-switching.arsenic
– arsenic level in respondent's well.dist
– distance (meters) from the respondent's house to the nearest well with safe drinking water.association
– binary/dummy (0 or 1) if member(s) of household participate in community organizations.educ
– years of education (head of household).
using CSV
using DataFrames
using TuringGLM
url = "https://github.com/TuringLang/TuringGLM.jl/raw/main/data/wells.csv";
wells = CSV.read(download(url), DataFrame)
switch | arsenic | dist | assoc | educ | |
---|---|---|---|---|---|
1 | 1 | 2.36 | 16.826 | 0 | 0 |
2 | 1 | 0.71 | 47.322 | 0 | 0 |
3 | 0 | 2.07 | 20.967 | 0 | 10 |
4 | 1 | 1.15 | 21.486 | 0 | 12 |
5 | 1 | 1.1 | 40.874 | 1 | 14 |
6 | 1 | 3.9 | 69.518 | 1 | 9 |
7 | 1 | 2.97 | 80.711 | 1 | 4 |
8 | 1 | 3.24 | 55.146 | 0 | 10 |
9 | 1 | 3.28 | 52.647 | 1 | 0 |
10 | 1 | 2.52 | 75.072 | 1 | 0 |
... | |||||
3020 | 1 | 0.66 | 20.844 | 1 | 5 |
Using switch
as dependent variable and dist
, arsenic
, assoc
, and educ
as independent variables:
fm = @formula(switch ~ dist + arsenic + assoc + educ)
FormulaTerm Response: switch(unknown) Predictors: dist(unknown) arsenic(unknown) assoc(unknown) educ(unknown)
Now we instantiate our model with turing_model
passing a keyword argument model=Bernoulli
to indicate that the model is a logistic regression:
model = turing_model(fm, wells; model=Bernoulli);
chn = sample(model, NUTS(), 2_000);
plot_chains(chn)
References
Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.