Will Procrastinators Support Environmental Tax?
Procrastination, in km could measure the distance between myself and success; in %, however, might explain why we still have no solution to Global Warming.
1. Introduction
Background
Carbon emissions are known to be the prime contributor to Global Warming. Yet, there are only 15% of the countries had adopted Carbon tax initiatives as of 2019. Among all the explanations, I am going to replicate my undergrad dissertation to prove people with present-bias (e.g. Procrastinators) are less likely to support environmental tax with Python and R in this blog post.
Time Discounting
To start with, we need to define what is carbon tax - by intentionally increasing the carbon cost, we discourage our short-term over-consumption for the long-term environmental benefit. And like an investment, time discounting plays a role when deciding whether to support a long-term tax policy. Essentially, we are putting a discount rate (%) on cash flows we will receive in the future. For example, I would prefer receiving $100 now than delaying it, as somehow, I feel like the same $100 would be of less value in a month, and it would further depreciate in a year. In other words, the more distant the environmental benefit, the less it meant to us at the present. The logic could be represented by Net Present Value (\(NPV\)), where \(t\) represents the period, \(T\) is the total number of periods, \(R_t\) is the cash flow at \(t\), and \(i\) is the interest rate: $$NPV(i,T) = \sum_{t=1}^T \frac{R_t}{(1+i)^t}$$ From the equation above, we can see two factors, time \(t\) and interest rate \(i\) that are negatively correlated with \(NPV\). In this case, we can also interpret \(i\) as the rate at which we discount values over time, or simply how impatient we are. As Amdur et al. (2015) proposed, “individuals’ discount rate would be negatively correlated with their support for environmental taxation”. Meaning people who are more impatient are less encouraged to support a long term policy like carbon tax. This is also going to be my Hypothesis 1 (H1), which is insignificantly supported in later sections, and will be seen as an evidence to employ time discounting to the subject matter.
Time Preference
While Amdur’s research was one of the first to empirically link the magnitude of time discounting with individuals’ support for environmental policy, their underlying assumption of dynamic consistency is debatable.
According to Samuelson's (1937) Discounted Utility Model, economists only assumed a constant discount rate \(i\) for analytical convenience. (For those who are interested: a constant discount rate means the marginal rate of substitution between two deferred goods separated by a fixed lag should be stable across time. Individuals with such preference would be described as having an exponential discounting function or a time consistent preference. (Laibson, 1996)) In Kirby's (1995, p.83) words:
If you would like a chocolate bar next Tuesday twice as much as an apple next Wednesday, you would like them in the same ratio for successive Tuesdays and Wednesdays a month, a year, or 10 years from now.Instead of chocolate bar and apples, I am going to compare the rate at which an individual would be indifferent between separated financial rewards in different timeframes with a titration approach.
Dynamic Inconsistency
Unlike the classical belief on time consistent preference adopted in Amdur’s research, recent studies have demonstrated dynamic inconsistency (Fedrick et al., 2002). This phenomena suggests some people might put higher weight on the present over future utility, (Present-Biased); or vice versa, (Future-Biased) (Goda et al., 2019).
For a present-bias example, you realised there was an approaching deadline on Day 1, and you thought you would start on the next day so you could enjoy the new Netflix Series for now. But on Day 2, you were impatient about the next episode and postponed your wish to work to Day 3... until the deadline. People with such procrastinating behaviour are explained in Behavioural Economics by how we would underestimate tomorrow's \((t+1)\) pleasure in today's \((t)\) point of view. Specifically, a present-biased individual would be increasingly impatient (i.e. associate higher discount rate) as an intertemporal choice approaches. From studies, such discounter would over-pursue immediate gratification, and be more likely to bear credit-card debt or have poor health status (O'Donoghue and Rabin, 2015; Meier and Sprenger, 2010; Ikeda et al., 2010).
Following this rationale, I propose individuals with such time preference would care more about the short-term sacrifices in paying additional tax, and weight less on the distant benefit of climate mitigation. Hence, Hypothesis 2 (H2) is that “individuals with present-bias would be less likely to support environmental taxation than individuals with consistent time preference”.
2. Data & Methodology
To prove the two hypotheses, I am using the household survey data from the UK Understanding Society Innovation Panel released by the Economic and social Data of the UK Data Service. (Full acknowledgement in my GitHub Repository) Every year since Wave 1 in 2008, same group of selected household representatives, who replicates the demographics in Britain, would be re-interviewed face-to-face or through the internet (Jäckle et al., 2019). As experiment modules would change in each wave, only Wave 7 (IP7) in 2014 includes all three experiments that are related to the research purposes. Among the total of 2,826 participants aged 16 or above in IP7, only 495 of them had participated in all the relevant experiment modules.
Methodology
The questionnaire results are categorised into two datasets. Hence, I will firstly focus on classifying data from different experiment modules in g_indresp_ip, and converting them into meaningful variables with Python. Then I will identify those demographics information in g_indall_ip that would be of value. Once all variables are created, I will combine them into one dataframe before running the Ordered regression model in R.
The following packages are imported to set up the Python environment:
In [1]:
# data manipulation
import pandas as pd
import numpy as np
import re
import functools
from sklearn.preprocessing import StandardScaler
# visualisation
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
Data Description
To avoid confusion, I have only downloaded results from experiment modules that would be of use in a .xls file. The file is then imported as a Python Pandas dataframe with the first column "pidp", which is uniquely representing each participant as the index.
In [3]:
# create a panda dataframe by reading the csv file with the first column as index
g_indresp = pd.read_excel("g_indresp_ip.xls", index_col = 0)
As shown below, the variable series g_choice are results of the "Time and Risk Preference" module to determine interviewees' time preference and time discount rate; whereas the g_envtax series is from the "Environmental Tax" module to check if the interviewee would support environmental tax; lastly, the g_envhabit is for controlling the interviewees' attitude on environmental behaviour from the "Environmental Behaviour and Attitudes" module.
In [4]:
# print column values
print(g_indresp.columns.values)
Out [4]:
['g_choice01', 'g_choice02', 'g_choice03', 'g_choice04', 'g_choice05',
'g_choice06', 'g_choice07', 'g_choice08', 'g_choice09', 'g_choice10',
'g_choice11', 'g_choice12', 'g_choice13', 'g_choice14', 'g_choice15',
'g_choice16', 'g_choice17', 'g_choice18', 'g_choice19', 'g_choice20',
'g_choice21', 'g_choice22', 'g_choice23', 'g_choice24', 'g_choice25',
'g_choice26', 'g_choice27', 'g_choice28', 'g_choice29', 'g_choice30',
'g_choice31', 'g_choice32', 'g_choice33', 'g_choice34', 'g_choice35',
'g_choice36', 'g_choice37', 'g_choice38', 'g_choice39', 'g_choice40',
'g_choice41', 'g_choice42', 'g_choice43', 'g_choice44', 'g_choice45',
'g_choice46', 'g_choice47', 'g_choice48', 'g_choice49', 'g_choice50',
'g_choice51', 'g_choice52', 'g_choice53', 'g_choice54', 'g_choice55',
'g_choice56', 'g_choice57', 'g_choice58', 'g_choice59', 'g_choice60',
'g_choice61', 'g_choice62', 'g_choice63', 'g_choice64', 'g_choice65',
'g_choice66', 'g_choice67', 'g_choice68', 'g_choice69', 'g_choice70',
'g_choice71', 'g_choice72', 'g_envtax_a', 'g_envtax_b', 'g_envtax_c',
'g_envtax_d', 'g_envtax_e', 'g_envtax_f', 'g_envtax_g', 'g_envtax_h',
'g_envhabit1', 'g_envhabit2', 'g_envhabit3', 'g_envhabit4', 'g_envhabit5',
'g_envhabit6', 'g_envhabit7', 'g_envhabit8', 'g_envhabit9']
Dependent Variable - Environmental Taxation
In the “Environmental Tax” module, respondents were randomly assigned with one of eight questions, consisted with four different settings, each with or without an additional common treatment (i.e. 4 settings x (with + without treatment) = 8 Questions from a to h). While the treatment is going to be intuitive, the four question settings were unrelated to time discounting. They will be disregarded as if all subjects were asked by the baseline question:
How willing would you be to pay higher taxes in order to protect the environment?
Hence, a new dataframe envtax_table is created by extracting those related columns from g_indresp.
From the 10 samples below, we can see once respondents were assigned with one of the eight questions, they would be choosing their degree of Willingness To Pay environmental tax, while the rest of the seven questions would be marked as "inapplicable":
In [5]:
# filter columns with name containing "g_envtax_" to create a new dataframe
envtax_table = g_indresp.filter(regex = 'g_envtax_', axis = 1)
# take ten random samples from envtax
envtax_table.sample(10)
Out [5]:
g_envtax_a g_envtax_b g_envtax_c g_envtax_d g_envtax_e g_envtax_f g_envtax_g g_envtax_h
pidp
1700103367 inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable not at all inapplicable
willing
1700425007 inapplicable inapplicable inapplicable inapplicable not at all inapplicable inapplicable inapplicable
willing
1700950651 not very inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable
willing
1701061487 inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable
1726302446 inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable not very inapplicable
willing
1700792891 proxy proxy proxy proxy proxy proxy proxy proxy
1732483652 inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable
1700369251 inapplicable inapplicable inapplicable not at all inapplicable inapplicable inapplicable inapplicable
willing
1700503207 inapplicable very willing inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable
1700180211 inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable not very willing
By creating a pivot table, the aggregate number of answers in each question are counted. The majority are choosing "Fairly Willing", followed by "Not Very Willing" in most questions. As expected, positive answers spiked when participants were receiving the common treatment in question b, d, f, h, which will be controlled in the next session.
In [6]:
# create a new empty panda dataframe
pivot_tax = pd.DataFrame()
# for each column in envtax_table, create a pivot table and append it to the newly created dataframe
for col in envtax_table:
pivot = envtax_table.pivot_table(columns=col, aggfunc="size",fill_value=0)
pivot_tax[col] = pivot
# print out the dataframe to count the unique value in each column
print(pivot_tax)
Out [6]:
g_envtax_a g_envtax_b g_envtax_c g_envtax_d \
g_envtax_a
don't know 5 10 4 15
fairly willing 77 98 69 97
inapplicable 2098 2105 2133 2099
not at all willing 68 15 49 30
not very willing 77 41 71 46
proxy 76 76 76 76
very willing 12 67 8 49
g_envtax_e g_envtax_f g_envtax_g g_envtax_h
g_envtax_a
don't know 5 10 2 14
fairly willing 80 84 93 96
inapplicable 2115 2104 2092 2105
not at all willing 58 50 60 21
not very willing 63 53 72 35
proxy 76 76 76 76
very willing 13 32 14 66
With answers from different questions, it will be the most ideal to consider each independently if data was sufficient. However, for preserving the statistical power given our small sample size, I am going to disregard the differences in settings, which are unrelated to time discounting, and treat all questions as if they were the baseline with (b) or without treatment (a). Although framing effects are inevitable, the experiment designer, Fairbrother (2017) had proved such adverse effect is minimal in his publication. Hence, I am going to consolidate the questions into a new variable envtax.
In [7]:
# define a function to check in the inputted row from the first column to the last,
# when would the cell value deviate from "inapplicable" so to return that value
def to_the_right(col):
value = "inapplicable"
col_index = 0
while value == "inapplicable" and col_index < 8:
value = col[col_index]
col_index += 1
return value
# define a function to apply to_the_right() to every row in the inputted dataframe
def clean(df):
df.loc[:,"envtax"] = df.apply(lambda col: to_the_right(col), axis = 1)
# run clean() on envtax_table
clean(envtax_table)
# take ten random samples from envtax_table
envtax_table.sample(10)
Out [7]:
g_envtax_a g_envtax_b g_envtax_c g_envtax_d g_envtax_e g_envtax_f g_envtax_g g_envtax_h envtax
pidp
1700910531 inapplicable very willing inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable very willing
1701568775 inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable not very not very
willing willing
1700711971 inapplicable inapplicable inapplicable inapplicable inapplicable not at all inapplicable inapplicable not at all
willing willing
1720998486 inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable not at all not at all
willing willing
1726717246 inapplicable inapplicable not at all inapplicable inapplicable inapplicable inapplicable inapplicable not at all
willing willing
1700819475 inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable fairly willing inapplicable fairly willing
1700442023 not very inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable not very
willing willing
1701483087 inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable fairly willing fairly willing
1700559651 inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable
1721290886 inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable
Unfortunately, not every interviewees had joint the "Environmental Tax" module. Absentees or participants with inaudible answer, e.g. "Don't know", would be dropped from envtax_table.
In [8]:
# drop rows if the value in envtax is not recognise in the list
envtax_table.drop(envtax_table[~envtax_table.envtax.isin(["not at all willing", "not very willing", "fairly willing", "very willing"])].index , inplace=True)
# call out the pivot table to count the unique value in the column
envtax_table.pivot_table(columns="envtax", aggfunc="size",fill_value=0)
Out [8]:
envtax
fairly willing 693
not at all willing 351
not very willing 458
very willing 261
dtype: int64
The final step before finalising the dependent variable, envtax is to transform the categorical text data into categorical numerical data as below:
- Very willing (= 4)
- Fairly Willing (= 3)
- Not Very Willing (= 2)
- Not at All Willing (= 1)
In [9]:
# creating a dictionary to assign numerically ordered values to the categorical answers
encoding = {"envtax": {"very willing": 4, "fairly willing": 3, "not very willing": 2, "not at all willing": 1}}
# replacing the string with the corresponding numeric values
envtax_table = envtax_table.replace(encoding)
Independent Variable - Treatment
As mentioned, a random treatment was imposed in the “Environmental Tax” module which introduced an additional condition at the end of the question for subjects’ Willingness To Pay environmental tax (i.e. b, d, f, h):
…, if the government reduced other taxes you pay by the same amount?For simplicity, treatment will be seen as a direct reduction on proximate financial costs, which as mentioned, has motivated more participants to be more willing to pay.
A dummy variable treatment is created with value = 1 for subjects who had received the treatment in question b, d, f, h; and = 0 otherwise.
In [10]:
# define a funtion to count in the inputted row,
# how many columns from left to right have value "inapplicable" until a deviation is detected
# if the number of counts is equal to 2 or 4 or 6 or 8, assign 1 to the new variable, otherwise, assign 0
def treat(col):
value = "inapplicable"
count = 0
col_index = 0
while value == "inapplicable" and col_index < 9:
value = col[col_index]
count += 1
col_index += 1
if count == 2 or count == 4 or count == 6 or count == 8:
treatment = 1
elif count == 9:
treatment = ""
else:
treatment = 0
return treatment
# define a function to apply treat to every row in the inputted dataframe
def treatment(df):
df.loc[:, "treatment"] = df.apply(lambda col: treat(col), axis = 1)
# run treatment() on evtax_table
treatment(envtax_table)
# take ten random samples from envtax_table
envtax_table.sample(10)
Out [10]:
g_envtax_a g_envtax_b g_envtax_c g_envtax_d g_envtax_e g_envtax_f g_envtax_g g_envtax_h envtax treatment
pidp
1701842807 not very inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable 2 0
willing
1719924092 inapplicable inapplicable fairly willing inapplicable inapplicable inapplicable inapplicable inapplicable 3 0
1700728967 inapplicable inapplicable inapplicable inapplicable not very inapplicable inapplicable inapplicable 2 0
willing
1726560926 inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable fairly willing 3 1
1719924132 inapplicable inapplicable fairly willing inapplicable inapplicable inapplicable inapplicable inapplicable 3 0
1700894891 inapplicable very willing inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable 4 1
1713668132 inapplicable inapplicable inapplicable inapplicable not at all inapplicable inapplicable inapplicable 1 0
willing
1721596886 inapplicable inapplicable inapplicable inapplicable not at all inapplicable inapplicable inapplicable 1 0
willing
1700499127 inapplicable inapplicable inapplicable fairly willing inapplicable inapplicable inapplicable inapplicable 3 1
1717496452 inapplicable inapplicable inapplicable inapplicable not at all inapplicable inapplicable inapplicable 1 0
willing
Independent Variable - Time Discounting
In the “Time Preference and Risk Aversion” module respondents were asked 6 sets of 12 questions based on hypothetical financial rewards with 10% chance of receiving an actual payment for one of their choices to ensure genuine participation. Check Appendix B for full set of questions.
In all 72 questions, respondents were asked to choose either receiving a Sooner & Smaller reward (hereafter SS) or a Later & Larger reward (hereafter LL) with 6 different timeframes in the form of i_j, specified in each set. For i defines when would SS be received. Either in ‘near’ term (i.e. receive SS today: i = 0; grey line in illustration), or in ‘distant’ term (i.e. receive SS with 1-month front end delay i = 1; dotted line in illustration); for j indicates the lag between SS and LL. To receive LL j month(s) after SS, where j = 1, 3 or 12; line span in illustration.
The difference between Smaller and Larger reward are interpreted as compensation for waiting. (i.e. Range of compensation offered varies in proportion to lag length j as shown in Appendix B.) The percentage difference in reward will be the compensation rate (hereafter comp_rate) offered in each question. For example, for Choice01 in 1_1,
Between Option (A): Receiving £100 in 1 month and Option (B): Receiving £100.42 in 2 months, which option do you prefer?if an individual preferred SS, comp_rate of 0.42% would be insufficient, and a higher comp_rate should be offered to compensate that individual for waiting 1 more month.
To deduce an individual’s valuation on waiting in each timeframe, however, the titration approach was adopted (Admur et al., 2015). Within the same set of 12 questions, the amount of LL offered would increase marginally while keeping SS and i_j unchanged. As comp_rate increases, respondents were expected to switch from choosing SS to LL eventually (Kirby et al., 1999).
Hence, if an individual chose SS for Choice01 (comp_rate of 0.42% for waiting 1 month) but preferred LL for Choice02 (comp_rate of 0.83% for waiting 1 month), there should be a comp_rate between 0.42% to 0.83% that makes the individual indifferent between SS and LL. For computation, I will speculate the preferred comp_rate for each i_j at around the mid-point. (e.g. 0.83%/2) Intuitively, comp_rate represents the minimum compensation rate or the time discount rate required to convince that individual to wait for a specific timeframe.
Six dataframes periodi_j were created to correspond with each timeframe.
In [11]:
# filter the columns in g_indresp by RegEx to create six new dataframes
period1_2 = g_indresp.filter(regex = 'g_choice0[1-9]|g_choice1[0-2]', axis = 1)
period1_4 = g_indresp.filter(regex = 'g_choice1[3-9]|g_choice2[0-4]', axis = 1)
period1_13 = g_indresp.filter(regex = 'g_choice2[5-9]|g_choice3[0-6]', axis = 1)
period0_1 = g_indresp.filter(regex = 'g_choice3[7-9]|4[0-8]', axis = 1)
period0_3 = g_indresp.filter(regex = 'g_choice49|g_choice5[0-9]|g_choice60', axis = 1)
period0_12 = g_indresp.filter(regex = 'g_choice6[1-9]|g_choice7[0-2]', axis = 1)
# create a list referring to the six new dataframes
period = [period1_2, period1_4, period1_13, period0_1, period0_3, period0_12]
# take ten random samples from one of the six dataframes
period0_1.sample(10)
Out [11]:
g_choice37 g_choice38 g_choice39 g_choice40 g_choice41 g_choice42 g_choice43 g_choice44 g_choice45 g_choice46 g_choice47 g_choice48
pidp
1713817692 inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable
1700500491 inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable
1700141455 proxy proxy proxy proxy proxy proxy proxy proxy proxy proxy proxy proxy
1720651652 inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable
1700360411 inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable
1701575567 option a: option a: option a: option b: option b: option b: option b: option b: option b: option b: option b: option b:
receiving receiving receiving receiving receiving receiving receiving receiving receiving receiving receiving receiving
�100 today �100 today �100 today �101.67 in �102.08 in �102.50 in �103.33 in �104.17 in �105 in �106.67 in �108.33 in �112.50 in
1 month 1 month 1 month 1 month 1 month 1 month 1 month 1 month 1 month
1701491247 option a: option a: option a: option a: option a: option a: option a: option a: option b: option b: option b: option b:
receiving receiving receiving receiving receiving receiving receiving receiving receiving receiving receiving receiving
�100 today �100 today �100 today �100 today �100 today �100 today �100 today �100 today �105 in �106.67 in �108.33 in �112.50 in
1 month 1 month 1 month 1 month
1702475366 inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable
1721923246 inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable
1700544691 inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable inapplicable
As shown in the sampling above, some interviewees did not take part in this experiment module, hence, dropping them from the data:
In [12]:
# identified the rows with inaudible values in g_choice01
drop_index_choice = g_indresp[g_indresp["g_choice01"].str.contains("inapplicable|proxy")].index
# drop the identified rows by index in each six dataframes
for df in period:
df.drop(drop_index_choice, inplace = True)
Before I deduce the compensation rate, I have to transform all cells' value from a string e.g. option A: receiving �100 today, into a numeric value 100. Then I will calculate the percentage difference between the first column with the second, the second column with the third, etc. until the percentage difference is higher than 0, so to halve that value and return a new variable comp_rate.
In [13]:
# define a function to loop across every rows and columns in a dataframe to extract and replace with numeric values in the cell
def extract(df):
for col in range(df.shape[1]):
for row in range(df.shape[0]):
cell = df.iloc[row, col]
number = re.findall(r"-?\d+\.?\d*", cell)
df.iat[row, col] = number[0]
# define a function, comp_cal(), that apply the rate formula on the first and second column, the second and third column, etc. until rate is no longer 0
def comp_cal(col):
comp = 0
col_index = 0
rate = (lambda now, later: (later - now)/(now*2))
while comp == 0 and col_index < 11:
comp = rate(float(col[col_index]), float(col[col_index + 1]))
col_index += 1
return comp
# define a function to apply the comp_cal() to each row in a dataframe
def compensation(df):
df.loc[:, "comp_rate"] = df.apply(lambda col: comp_cal(col), axis = 1)
# apply extract() and compensation() to each dataframe in period
for df in period:
extract(df)
compensation(df)
# take ten random samples from one of the six dataframes
period0_1.sample(10)
Out [13]:
g_choice37 g_choice38 g_choice39 g_choice40 g_choice41 g_choice42 g_choice43 g_choice44 g_choice45 g_choice46 g_choice47 g_choice48 comp_rate
pidp
1700163887 100 100 100 100 100 100 100 100 100 100 100 100 0.0000
1701108411 100 100 100 100 100 100 100 100 105 106.67 108.33 112.50 0.0250
1701429367 100 100 100 100 100 100 100 104.17 105 106.67 108.33 112.50 0.0209
1701672807 100 100 100 100 100 100 100 100 100 100 100 100 0.0000
1701704771 100 100 100 100 100 100 100 104.17 105 106.67 108.33 112.50 0.0209
1700476007 100 100 100 100 100 100 100 100 100 100 100 100 0.0000
1700864967 100 100 100 100 100 100 100 100 100 100 100 100 0.0000
1701575567 100 100 100 101.67 102.08 102.50 103.33 104.17 105 106.67 108.33 112.50 0.0084
1723181246 100 100 100 100 100 100 100 100 100 100 100 100 0.0000
1700096567 100 100 100 100 100 100 100 100 100 100 100 100 0.0000
But for respondents who persisted in SS throughout, the above mechanism would be inapplicable as their compensation required lay above the highest offered in the 12th question in the set. Hence, those with discount rate exceeded the upper bound of the experimental design, would have comp_rate estimated at the upper bound.
In [14]:
# create lists referring to the corresponding periods dataframes with different lag length
period1 = [period0_1, period1_2]
period3 = [period0_3, period1_4]
period12 = [period0_12, period1_13]
# replace the comp_rate for those respondents who persisted in SS with the upper bound
for df in period1:
df.comp_rate.replace(0,0.1250, inplace = True)
for df in period3:
df.comp_rate.replace(0,0.3750, inplace = True)
for df in period12:
df.comp_rate.replace(0,1.5000, inplace = True)
# take five random samples from one of the six dataframes
period0_1.sample(5)
Out [14]:
g_choice37 g_choice38 g_choice39 g_choice40 g_choice41 g_choice42 g_choice43 g_choice44 g_choice45 g_choice46 g_choice47 g_choice48 comp_rate
pidp
1722888846 100 100 100 100 100 100 103.33 104.17 105 106.67 108.33 112.50 0.01665
1700532447 100 100 100 100 100 100 100 100 100 100 100 100 0.1250
1700061891 100 100 100 100 100 100 100 100 100 100 100 100 0.1250
1721046046 100 100 100 100 100 100 100 100 100 100 100 100 0.1250
1700577331 100 100 100 100 100 100 100 100 100 100 100 112.50 0.0625
After deducing comp_rate in each period, I will consolidate them into a new dataframe, comp, as below:
In [15]:
# create a new list with items to be referenced as names of the new columns
comp_list = ["comp1_2", "comp1_4", "comp1_13", "comp0_1", "comp0_3", "comp0_12"]
# create an empty dictionary for later
comp_dic = {}
# by zipping the two reference lists, comp_list and period
# items in the two lists are looped to fill in the new dictionary
for x, y in list(zip(comp_list, period)):
comp_dic[x] = y.loc[:,"comp_rate"]
# use the dictionary to create a new dataframe, comp
comp = pd.DataFrame(comp_dic)
# take five random samples from comp
comp.sample(5)
Out [15]:
comp1_2 comp1_4 comp1_13 comp0_1 comp0_3 comp0_12
pidp
1700325727 0.006250 0.050000 0.15000 0.004150 0.025000 0.10000
1700855447 0.008350 0.031250 0.10000 0.016650 0.025000 0.05000
1700638527 0.002041 0.006173 0.02381 0.002041 0.006173 0.02381
1723541646 0.020850 0.006173 0.02381 0.062500 0.075000 0.20000
1721046046 0.125000 0.375000 1.50000 0.125000 0.375000 1.50000
The sample mean E(.) and sample standard deviation σ(.) as shown:
In [16]:
# call the aggregated mean and standard deviation of each column
comp.agg(['mean','std'])
Out [16]:
comp1_2 comp1_4 comp1_13 comp0_1 comp0_3 comp0_12
mean 0.062949 0.170676 0.540580 0.069469 0.169673 0.468039
std 0.053565 0.157790 0.580647 0.051519 0.153347 0.557025
To test H1, it would be impractical to separately regress the six comp_rate from different timeframes against the dependent variable. Therefore, a continuous independent variable, standardised compensation rate (std_comp) is created. By obtaining subjects’ standard score in each comp_rate, I compare it with the sample mean in that timeframe. Next, take a simple average of all six standard scores to construct an i_j averaging proxy which should reflect the general compensation rate of a subject relative to the sample (Ikeda, et al. 2010). $$std\_comp = \frac{1}{6}\sum_{i=0,1}\sum_{j=1,3,12} \frac{comp_{i\_j}-E(comp_{i\_j})}{\sigma(comp_{i\_j})}$$
In [17]:
# standardised the value in each column
for col in comp_list:
comp[col] = StandardScaler().fit_transform(comp[[col]])
# take a simple average and append the value in a new column
comp["std_comp"] = comp.iloc[:, 0:6].sum(1)/6
Out [17]:
comp1_2 comp1_4 comp1_13 comp0_1 comp0_3 comp0_12 std_comp
pidp
1701061487 -0.008391 -0.448343 1.653927 1.078909 1.340265 1.854421 0.911798
1700740531 -0.008391 -0.765526 -0.802604 -0.135404 -0.781161 -0.751210 -0.540716
1701289967 -1.138181 -1.043551 -0.890851 -1.310053 -1.067241 -0.798274 -1.041358
1722916086 -0.709154 -0.765526 -0.500925 -0.863991 -0.781161 -0.661361 -0.713686
1725282526 1.159547 1.296167 1.653927 -1.310053 -1.067241 -0.798274 0.155679
Independent Variable - Time Preference
Subjects’ time preference are categorised by comparing their timeframe specific comp_rate in ‘near’ term (i.e. i = 0) and in ‘distant’ term (i.e. i = 1) of the same lag j.
Firstly, if an individual required more compensation for postponing a reward today than postponing a reward with a front-end delay of the same lag (i.e. comp0_j > comp1_j), that individual would be subjected to "present-bias". Meaning the subject would give higher priority to events closer in time.
Secondly, if comp0_j < comp1_j, the subject would be categorised as future-biased whose time discount rate increases with time and would put higher weight on distant events. Takeuchi (2011) found the majority of his observations were "future-biased", represented by an inverse shaped discount function.
Thirdly, if comp0_j = comp1_j, , the subject would value near and distant events with a time "consistent" preference, hence, required the same compensation for postponing rewards today or postponing rewards with a front-end delay if the lag length were the same.
Lastly, in cases when respondents had persistently preferred SS in any set, with comp_rate assumed at upper bound, their corresponding time preference would be unidentifiable, namely “unidentifiable".
In [18]:
# create another dataframe, time, with comp_dic in In [16]
time = pd.DataFrame(comp_dic)
# create a new variable for j = 1 and assign value with the logical arguments presented above
time["time_1"] = np.where(time["comp1_2"] < time["comp0_1"], "pre_bi", np.where(time["comp1_2"] > time["comp0_1"], "fut_bi", np.where((time["comp1_2"] == time["comp0_1"]) & (time["comp0_1"] == 0.125), "unidentifiable", "con")))
# create a new variable for j = 3 and assign value with the logical arguments presented above
time["time_3"] = np.where(time["comp1_4"] < time["comp0_3"], "pre_bi", np.where(time["comp1_4"] > time["comp0_3"], "fut_bi", np.where((time["comp1_4"] == time["comp0_3"]) & (time["comp0_3"] == 0.375), "unidentifiable", "con")))
# create a new variable for j = 12 and assign value with the logical arguments presented above
time["time_12"] = np.where(time["comp1_13"] < time["comp0_12"], "pre_bi", np.where(time["comp1_13"] > time["comp0_12"], "fut_bi", np.where((time["comp1_13"] == time["comp0_12"]) & (time["comp0_12"] == 1.5), "unidentifiable", "con")))
# take five random samples from time
time.sample(5)
Out [18]:
comp1_2 comp1_4 comp1_13 comp0_1 comp0_3 comp0_12 time_1 time_3 time_12
pidp
1700361771 0.012500 0.025000 0.125 0.062500 0.0500 0.075 pre_bi pre_bi fut_bi
1701527967 0.002041 0.012500 0.050 0.002041 0.0125 0.050 con con con
1700399847 0.002041 0.006173 0.125 -0.002091 0.1875 0.250 fut_bi pre_bi pre_bi
1726615246 0.025000 0.050000 0.125 0.033350 0.0625 0.150 pre_bi pre_bi pre_bi
1700238007 -0.002091 0.375000 1.500 0.125000 0.3750 1.500 pre_bi unidentifiable unidentifiable
After applying the logical arguments to identify participants'time preferences in each lag length j, only the relevant columns in the dataframe, time, would be left for simplicity:
In [19]:
# create an empty column, time_pre, for later
time["time_pre"] = ""
# select specific columns from the dataframe
time = time[["time_1", "time_3", "time_12", "time_pre"]]
# take another five random samples from the updated time dataframe
time.sample(5)
Out [19]:
time_1 time_3 time_12 time_pre
pidp
1723779646 pre_bi unidentifiable unidentifiable
1700776567 fut_bi fut_bi con
1700001367 fut_bi con con
1726771646 pre_bi fut_bi fut_bi
1700243447 fut_bi fut_bi fut_bi
Accordingly, all respondents would have three time preference labels for each lag length j of either "present-bias", "futture-bias", "consistent", "unidentifiable". The count by category is summarised in the pivot table below:
In [20]:
# create a pivot table that count the occurence of each value divided by time_1, time_3 and time_12
table = time.pivot_table(index = ['time_1'], columns=['time_3','time_12'], aggfunc='count')
# call out the pivot table
table
Out [20]:
time_pre
time_3 con fut_bi pre_bi unidentifiable
time_12 con fut_bi pre_bi con fut_bi pre_bi unidentifiable con fut_bi pre_bi unidentifiable con fut_bi pre_bi unidentifiable
time_1
con 30.0 14.0 4.0 5.0 15.0 6.0 1.0 6.0 6.0 2.0 1.0 1.0 1.0 1.0 NaN
fut_bi 8.0 11.0 NaN 7.0 36.0 5.0 1.0 2.0 12.0 6.0 1.0 1.0 2.0 2.0 2.0
pre_bi 6.0 14.0 7.0 5.0 25.0 3.0 NaN 11.0 35.0 31.0 7.0 2.0 4.0 3.0 11.0
unidentifiable 3.0 5.0 2.0 5.0 19.0 3.0 NaN 6.0 4.0 3.0 3.0 15.0 22.0 13.0 62.0
Although ideally, the three labels should be coherent to demonstrate a comprehensible general time preference, deviation were observed in most subjects. As empirical researches rarely evaluate time preferences with three different lag lengths, regrettably, original assumptions have to be made to relax the grouping criteria.
- For the weakly coherent scenario, if any two of the three labels are the same, the majority would be adopted as the general time preference.
- For the weakly identifiable scenario, if any two labels are "unidentifiable", the remaining identifiable one will be deemed as the general time preference.
In [21]:
# create a new column, time_pre, and loop through all rows to fill the cell value defined by the criteria above
pre = ["con", "pre_bi", "fut_bi"]
for pre in pre:
for i in time.index:
if time.loc[i, "time_1"] == pre and time.loc[i, "time_3"] == pre:
time.loc[i, "time_pre"] = pre
elif time.loc[i, "time_1"] == pre and time.loc[i, "time_12"] == pre:
time.loc[i, "time_pre"] = pre
elif time.loc[i, "time_3"] == pre and time.loc[i, "time_12"] == pre:
time.loc[i, "time_pre"] = pre
elif time.loc[i, "time_1"] == pre and time.loc[i, "time_3"] == "unidentifiable" and time.loc[i, "time_12"] == "unidentifiable":
time.loc[i, "time_pre"] = pre
elif time.loc[i, "time_1"] == "unidentifiable" and time.loc[i, "time_3"] == "unidentifiable" and time.loc[i, "time_12"] == pre:
time.loc[i, "time_pre"] = pre
elif time.loc[i, "time_1"] == "unidentifiable" and time.loc[i, "time_3"] == pre and time.loc[i, "time_12"] == "unidentifiable":
time.loc[i, "time_pre"] = pre
else:
continue
# assign the value "unidentiable" to those empty cells
for i in time.index:
if time.loc[i, "time_pre"] == "":
time.loc[i, "time_pre"] = "unidentifiable"
# take another five random samples from the updated time dataframe
time.sample(5)
Out [21]:
time_1 time_3 time_12 time_pre
pidp
1700992127 pre_bi unidentifiable unidentifiable pre_bi
1700914611 unidentifiable unidentifiable unidentifiable unidentifiable
1701338943 unidentifiable con con con
1700157091 pre_bi pre_bi fut_bi pre_bi
1701673495 fut_bi fut_bi con fut_bi
After deducing a time preference for each subject, dummy variables (i.e. with value = 1 if the subject fits the category; otherwise value = 0) are created for later regression. Observations with "unidentifiable" time preference will be dropped.
In [22]:
# break down the column, time_pre, in which categorical values are used to create new columns with dummy values
time = pd.get_dummies(time, prefix = "t", prefix_sep = "_", columns = ["time_pre"])
# drop the rows with unidentifiable time_pre
time.drop( time[(time["t_unidentifiable"] == 1) ].index , inplace=True)
# take another five random samples from the updated time dataframe
time.sample(5)
Out [22]:
time_1 time_3 time_12 t_con t_fut_bi t_pre_bi t_unidentifiable
pidp
1725792446 pre_bi pre_bi fut_bi 0 0 1 0
1700582767 unidentifiable fut_bi fut_bi 0 1 0 0
1723262966 unidentifiable unidentifiable con 1 0 0 0
1724670486 con pre_bi pre_bi 0 0 1 0
1700238687 unidentifiable unidentifiable pre_bi 0 0 1 0
Independent Variable - Environmental Morale
It is believed that the higher an individual’s environmental morale, the more often environmentally friendly behaviour would be conducted, and the more likely environmental tax would be supported by the individual. Hence, the “Environmental Behaviour and Attitude” module is used to measure if subjects would derive intrinsic values from protecting the environment.
In [23]:
# filter the columns in g_indresp by RegEx to create a new dataframe
envhabit = g_indresp.filter(regex = 'g_envhabit[2,4-9]', axis = 1)
# drop the rows with unidentifiable envhabit
envhabit.drop(envhabit[~envhabit.g_envhabit2.isin(["always", "very often", "quite often", "not very often", "never", "n/a, can't do this", "don't know"])].index , inplace=True)
# take five random samples from the updated time dataframe
envhabit.sample(5)
Out [23]:
g_envhabit2 g_envhabit4 g_envhabit5 g_envhabit6 g_envhabit7 g_envhabit8 g_envhabit9
pidp
1700539927 never always never never always quite often quite often
1701822411 very often very often never very often always never never
1700975807 always very often never not very often never not very often quite often
1700870411 always quite often quite often very often always not very often never
1721365686 quite often very often not very often quite often not very often quite often never
With reference to Barile et al. (2015), a continuous independent variable, Environmental Habit Index envhabit, is created by firstly assigning ordered numeric values to subjects’ self-reported frequency of performing different pro-environment habits with “Always” = 5 to “Never” = 1, then taking the simple average of the aggregated values as follow:
In [24]:
# creating new columns reflecting the self-assigned numeric values of their corresponding answers to different environmental habits
new_col = ["envhabit_2", "envhabit_4", "envhabit_5", "envhabit_6", "envhabit_7", "envhabit_8", "envhabit_9"]
new_index = 0
for name in envhabit.columns:
score = []
for participant in envhabit[name]:
if participant == "always":
score.append(5)
elif participant == "very often":
score.append(4)
elif participant == "quite often":
score.append(3)
elif participant == "not very often":
score.append(2)
else:
score.append(1)
envhabit[new_col[new_index]] = score
new_index += 1
# taking simple average on the newly created columns with numeric values
envhabit["envhabit"] = envhabit.iloc[:, 7:14].sum(1)/7
# take another five random samples from the updated time dataframe
envhavit.sample(5)
Out [24]:
g_envhabit2 g_envhabit4 g_envhabit5 g_envhabit6 g_envhabit7 g_envhabit8 g_envhabit9 envhabit_2 envhabit_4 envhabit_5 envhabit_6 envhabit_7 envhabit_8 envhabit_9 envhabit
pidp
1720916846 very often very often quite often very often always never never 4 4 3 4 5 1 1 3.142857
1719108092 quite often quite often never don't know always quite often not very often 3 3 1 1 5 3 2 2.571429
1724725006 always very often never very often never not very often never 5 4 1 4 1 2 1 2.571429
1700861567 always quite often never never very often very often very often 5 3 1 1 4 4 4 3.142857
1701103651 always not very often not very often not very often quite often always always 5 2 2 2 3 5 5 3.428571
Demographics
Another dataset, g_indall, containing background information of each participant is imported to extract their corresponding age and gender.
In [25]:
# read the .xls document with the first column being the Index
g_indall = pd.read_excel("g_indall_ip.xls", index_col = 0)
# create a dataframe by filtering relevenat columns from g_indall
other = g_indall.filter(regex = 'g_agegr10_dv|g_sex', axis = 1)
# identify and count the values in g_sex and g_agegr10_dv
other.g_sex.value_counts()
other.g_agegr10_dv.value_counts()
Out [25]:
female 1870
male 1749
missing 1
don't know 1
Name: g_sex, dtype: int64
50-59 years old 548
40-49 years old 504
60-69 years old 500
10-19 years old 469
70 years or older 411
0-9 years old 396
30-39 years old 395
20-29 years old 361
Missing 37
Name: g_agegr10_dv, dtype: int64
As shown, there were some missing values which would be dropped from the observations.
In [26]:
# dropping rows by index identified with missing values
other.drop(other[other.g_sex.isin(["don't know","missing"])].index , inplace=True)
other.drop(other[other.g_agegr10_dv.isin(["don't know","missing"])].index , inplace=True)
# take five random samples from the updated dataframe
other.sample(5)
Out [26]:
g_sex g_agegr10_dv
pidp
1729110852 male 70 years or older
1716422052 female 40-49 years old
1700027891 male 60-69 years old
1726452046 female 30-39 years old
1719788132 male 0-9 years old
For later regression, two dummy variables g_sex_male and g_sex_female are created for identifying the gender of a respondent. Also, in the questionnaire, participants were answering their age with bands of 10 years. Hence, a speculation is made to assign the age value at the mid-point e.g. assigning 35 to participants answering g_agegr10_dv with "30-39 years old".
In [27]:
# creating two dummy variables from the value in g_sex with the prefix "_"
other = pd.get_dummies(other, prefix_sep = "_", columns = ["g_sex"])
# creating a dictionary and replace the keys with their corresponding values
encoding = {"g_agegr10_dv": {"0-9 years old": 5, "10-19 years old": 15, "20-29 years old": 25, "30-39 years old": 35, "40-49 years old": 45, "50-59 years old": 55, "60-69 years old": 65, "70 years or older": 75}}
other = other.replace(encoding)
# take another five random samples from the updated dataframe
other.sample(5)
Out [27]:
g_agegr10_dv g_sex_female g_sex_male
pidp
1701245099 5 0 1
1700353607 35 1 0
1701077127 65 0 1
1721936886 65 1 0
1731538492 75 1 0
After extracting the relevant columns into separate datasets and transformed with audible numeric values, they are loaded into a new dataframe, rae.
In [28]:
# creating a list of names of the dataframes created thus far
data_frames = [envtax_table, comp, time, envhabit, other]
# inner joining the datafames with names from the list by the common index value, pidp
df = functools.reduce(lambda x, y: pd.merge(x, y, on = "pidp", how = "inner"), data_frames)
# list out the columns in the new consolidated dataframe
df.columns
Out [28]:
Index(['g_envtax_a', 'g_envtax_b', 'g_envtax_c', 'g_envtax_d', 'g_envtax_e',
'g_envtax_f', 'g_envtax_g', 'g_envtax_h', 'envtax', 'treatment',
'comp1_2', 'comp1_4', 'comp1_13', 'comp0_1', 'comp0_3', 'comp0_12',
'std_comp', 'time_1', 'time_3', 'time_12', 't_con', 't_fut_bi',
't_pre_bi', 't_unidentifiable', 'g_envhabit2', 'g_envhabit4',
'g_envhabit5', 'g_envhabit6', 'g_envhabit7', 'g_envhabit8',
'g_envhabit9', 'envhabit_2', 'envhabit_4', 'envhabit_5', 'envhabit_6',
'envhabit_7', 'envhabit_8', 'envhabit_9', 'envhabit', 'g_agegr10_dv',
'g_sex_female', 'g_sex_male'],
dtype='object')
From the list of columns above, select those that would be use in the regression.
In [29]:
# create a new dataframe with columns from df
rae = df[["envtax","treatment", "std_comp", "t_fut_bi", "t_pre_bi", "envhabit", "g_agegr10_dv", "g_sex_male"]]
# take five random samples from the newly created dataframe
rae.sample(5)
Out [29]:
envtax treatment std_comp t_fut_bi t_pre_bi envhabit g_agegr10_dv g_sex_male
pidp
1700214211 1 0 -0.276368 0 1 2.428571 55 1
1724119646 4 1 -0.110438 1 0 3.428571 75 1
1723310446 1 1 -0.377052 1 0 2.285714 65 1
1701727207 3 1 -0.177028 1 0 2.285714 45 0
1706126962 4 1 -0.996040 0 0 3.571429 25 0
After the extracting and transforming with Python, the dataset would be exported as a separate .csv file for later interpretation.
In [30]:
# exporting rae as a .csv file
rae.to_csv(r"C:\Users\lauwi\Desktop\Data\Python\RAE\RAE.csv", index = True)
3. Visualisation & Summary
But before we begin the regression in R, we are going to stick with Python to understand the data.
In [31]:
# calling out summary statistics of the consolidated dataset
rae.describe()
Out [31]:
envtax treatment std_comp t_fut_bi t_pre_bi envhabit g_agegr10_dv g_sex_male
count 304.000000 304.000000 304.000000 304.000000 304.000000 304.000000 304.000000 304.000000
mean 2.526316 0.496711 -0.213711 0.404605 0.355263 2.968985 52.993421 0.447368
std 0.950576 0.500814 0.685276 0.491625 0.479382 0.600491 17.107309 0.498042
min 1.000000 0.000000 -1.041358 0.000000 0.000000 1.000000 15.000000 0.000000
25% 2.000000 0.000000 -0.827215 0.000000 0.000000 2.571429 45.000000 0.000000
50% 3.000000 0.000000 -0.358651 0.000000 0.000000 3.000000 55.000000 0.000000
75% 3.000000 1.000000 0.302730 1.000000 1.000000 3.321429 65.000000 1.000000
max 4.000000 1.000000 1.181721 1.000000 1.000000 4.857143 75.000000 1.000000
As described from the table above, only 304 observations left after dropping those missing or inaudible entries. The average age of the remaining samples is 53 years old with 45% being male. Moreover, exactly half of the observations had received the treatment in the "Environment Tax" module which was meant to be randomly assigned to interviewees. This might be an indication that my active filtering does not alter much of the sample population distribution.
In [32]:
envtax = ["Very Willing\n(=4)", "Fairly Willing\n(=3)", "Not Very Willing\n(=2)", "Not At All Willing\n(=1)"]
xpos = np.arange(len(envtax))
plt.bar(xpos-0.2, rae.envtax.value_counts().sort_index(ascending=False), width=0.4, color = "#1d58a7d8", label="Overall")
plt.bar(xpos+0.2, rae[rae["treatment"] == 1].envtax.value_counts().sort_index(ascending=False), width=0.4, color = "#4aa6ca", label="Treatment")
plt.xticks(xpos, envtax, rotation = "350")
plt.ylabel("No. of Respondents")
plt.title("How willing would you be to pay higher taxes in order to protect the environment?")
plt.legend()
As shown in the diagram above, the majority of participants chose fairly willing (=3) to pay higher taxes to protect the environment regardless of settings. But with the treatment, participants are more likely to be very willing to pay (=4) and less likely to give a negative responses (=1 or =2) in line with Jagers et al. (2019).
In [33]:
plt.hist([rae.std_comp, rae[rae["t_fut_bi"] == 1].std_comp, rae[rae["t_pre_bi"] == 1].std_comp, rae[(rae["t_fut_bi"] == 0) & (rae["t_pre_bi"] == 0)].std_comp], rwidth=0.8, bins =15, color = ["#34495E","#ABB3B9","#2980B9","#4aa6ca"], label = ["overall","fut_bi","pre_bi","con"])
plt.xlabel("Standardised Compenstation Rate (std_comp)")
plt.ylabel("No. of Respondents")
plt.title("Distribution of Respondents' Deduced Compensation Rate")
plt.legend()
It is noticeable that respondents' standardised compensation rate are concentrated at the lower end, especially for those with consistent time preference whose std_comp are mostly accumulated at the lowest bound.
In [34]:
plt.hist(rae.envhabit, bins=15, histtype = "step")
plt.xlabel("Environemental Habbit Index")
plt.ylabel("No. of Respondents")
plt.title("How often do you personally do each of the following environmental habbit... \n from “Always” = 5 to “Never” = 1")
The data also demonstrated subjects' attitudes towards environmentally friendly behaviour are resembling a normal distribution. Meaning, the majority are not having strong opinion on the subject matter with a few outliners on both ends.
In [35]:
envtax = ["Very Willing\n(=4)", "Fairly Willing\n(=3)", "Not Very Willing\n(=2)", "Not At All Willing\n(=1)"]
xpos = np.arange(len(envtax))
plt.bar(xpos-0.2, rae[rae["g_sex_male"] == 1].envtax.value_counts().sort_index(ascending=False), width=0.4, color = "#85C1E9", label="Male")
plt.bar(xpos+0.2, rae[rae["g_sex_male"] == 0].envtax.value_counts().sort_index(ascending=False), width=0.4, color = "#F1948A", label="Female")
plt.xticks(xpos, envtax, rotation = "350")
plt.ylabel("No. of Respondents")
plt.title("How willing would you be to pay higher taxes in order to protect the environment?")
plt.legend()
Although female consists of 55% of the sample, it is noted that more male respondents are having stronger preference on the question (i.e. chose either =1 or =4).
In [36]:
plt.hist(rae.g_agegr10_dv, bins=len(rae.g_agegr10_dv.value_counts()), histtype = "stepfilled")
plt.xlabel("Age")
plt.ylabel("No. of Respondents")
plt.title("Distribution of Respondents' age")
It is also not surprising to see the age of participants in the experiments are relatively senior given it requires high commitment and would be suitable of retired people.
4. Results
As demonstrated in previous sessions, envtax, the dependent variable is with ordinal numeric values (i.e. envtax = 1 | 2 | 3 | 4) to represent respondents' willingness to pay environmental tax, Ordinary Least Squares (OLS) may no longer provides the best (BLU) estimate (McKelvey and Zavoina, 1975). Contrarily, ordinal regression selects the model with the highest maximum likelihood in explaining the underlining latent variable, envtax*, which is invisible and continuous to represent the discrete envtax (Winship and Mare, 1984).
However, after a throughout search in the internet, the only module in Python that support Ordinary regression, statsmodels 0.13.0.div0, is still at development stage. I have no choice but turn to the other statistical oriented programming language, R. Therefore, from now on I will be switching from Python to R for running ordered regressions to test my hypotheses in session 1.
In [37]:
# read the previously exported .csv file
csv <- read.csv("C:/Users/lauwi/Desktop/Data/Python/RAE/RAE.csv")
# check the first six rows in the dataset
head(csv)
Out [37]:
pidp envtax treatment std_comp t_fut_bi t_pre_bi envhabit g_agegr10_dv
1 1700000687 3 1 0.88 1 0 1.9 65
2 1700001367 4 0 -1.00 0 0 3.7 75
3 1700002059 3 0 1.18 0 1 2.7 25
4 1700002067 3 1 -0.52 0 1 2.1 25
5 1700015651 4 1 0.85 0 1 4.3 35
6 1700017007 1 0 1.03 0 1 3.1 75
g_sex_male
1 0
2 0
3 0
4 0
5 0
6 0
The command, polr (i.e. Proportional Odds Logistics Regression) from the MASS package will be used.
In [38]:
# install the MASS package
require(MASS)
require(foreign)
# define the dependent variable, envtax, as ordered factor variable
csv$envtax = factor (csv$envtax, levels =c("1", "2", "3", "4"), ordered = TRUE)
# fit the ordered logistics model using polr function with Hess = True to return the Hessian information matrix
o_logit <- polr(formula = envtax ~ std_comp + treatment + t_fut_bi + t_pre_bi + envhabit + g_agegr10_dv + g_sex_male, data = csv, Hess = TRUE, method = "logistic")
# call the summary statistics of the fitted ordered logistics model
summary(o_logit, digits=3)
Out [38]:
Call:
polr(formula = envtax ~ std_comp + treatment + t_fut_bi + t_pre_bi +
envhabit + g_agegr10_dv + g_sex_male, data = csv, Hess = TRUE,
method = "logistic")
Coefficients:
Value Std. Error t value
std_comp -0.2286 0.16677 -1.371
treatment 1.1910 0.22553 5.281
t_pre_bi -1.1402 0.30084 -3.790
t_fut_bi -0.4651 0.28731 -1.619
envhabit 0.5377 0.18549 2.899
g_agegr10_dv -0.0111 0.00644 -1.719
g_sex_male -0.1309 0.21891 -0.598
Intercepts:
Value Std. Error t value
1|2 -0.729 0.650 -1.122
2|3 0.626 0.646 0.969
3|4 3.021 0.672 4.495
Residual Deviance: 730.05
AIC: 750.05
From Out [38], we can see the regression output including the intercepts and the corresponding coefficient, standard error and t-value of each independent variable.
The estimated regression model for intercept "1|2", "2|3","3|4" could be written as: $$Logit(\hat P(Y\le j)) = intercept_j - (-0.23)\ast std\_comp - 1.19\ast treatment - (-0.47)\ast t\_fut\_bi $$ $$- (-1.14)\ast t\_pre\_bi - 0.54\ast envhabit - (-0.01)\ast g\_agegr10\_dv - (-0.13)\ast g\_sex\_male$$
Interpretation - Intercepts
The intercept (a.k.a. cut-point) indicates where the latent variable, envtax*, is cut to divide the thresholds for each level of envtax.
The cut-point "1|2" represents the log of odds (i.e. \(Logit(\hat P(Y\le 1))\) also as \(\log\frac{P(Y\le1)}{P(Y\gt2)}\) ) of an individual to be "Not at All Willing" to support environmental tax versus being "Not Very Willing", "Fairly Willing" or "Very Willing".
Similarly, "2|3" measures the log of odds (i.e. \(Logit(\hat P(Y\le 2))\) also as \(\log\frac{P(Y\le2)}{P(Y\gt3)}\) ) of an individual to be "Not at All Willing" or "Not Very Willing" to support environmental tax versus being "Fairly Willing" or "Very Willing".
Interpretation - Significance
By default, there is no significance testing in polr which is why I am comparing the t-value and standard error to calculate the p-value with reference to UCLA Statisitcal Consulting.
As p-value is to calculate the probability of obtaining an observation if the theory to be tested is invalid (e.g. the p-value of treatment measures how likely to see an extreme result if the treatment experiment was in fact a failure), I am going to use the significance level of 0.1, meaning I will reject the output when it has more than 10% chance of being observed purely out of randomness. For visibility, I will manually mark * in Out [39] for those rows of output with p-value less than 0.1; ** for those less than 0.05; and *** for less than 0.001.
In [39]:
# store the coefficient table
(ctable <- coef(summary(o_logit)))
# calculate and store p-value
p_logit <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE)*2
# apending p_logit into ctable corrected to 4 decimal places
(ctable <- cbind(ctable, "p value" = round(p_logit,4)))
Out [39]:
Value Std. Error t value p value
std_comp -0.2286350 0.166766341 -1.3709904 0.1704
treatment 1.1910311 0.225525819 5.2811299 0.0000 ***
t_pre_bi -1.1402099 0.300843640 -3.7900417 0.0002 ***
t_fut_bi -0.4651368 0.287312202 -1.6189248 0.1055
envhabit 0.5376822 0.185486482 2.8987677 0.0037 **
g_agegr10_dv -0.0110683 0.006438106 -1.7191863 0.0856 *
g_sex_male -0.1309155 0.218906550 -0.5980430 0.5498
1|2 -0.7289528 0.649592940 -1.1221687 0.2618
2|3 0.6256457 0.645954072 0.9685607 0.3328
3|4 3.0206391 0.672028873 4.4948056 0.0000
Interpretation - Coefficients
Since the output has been transformed with taking log odds, it would be difficult to interpret the coefficients. For instance, participants who had received treatment would increase the expected value of envtax in log odds by 1.19 but what does it mean?
So instead of drawing meanings from the magnitude, we could look into the direction of the coefficients to understand the correlation between envtax and the independent variables.
- std_comp - although the result is not statistically significant with p-value of 0.1704, the coefficient is negative which echoes our Hypothesis 1 that “individuals’ discount rate would be negatively correlated with their support for environmental taxation”.
- treatment - as expected, the coefficient is significantly positive (p-value = 0.0000) meaning interviewees selected for treatment would be more likely to support environmental tax since the proposed financial cost were to be mitigated.
- t_pre_bi - the negative coefficient has successfully proved hypothesis 2 with high significance (p-value = 0.0002) that “individuals with present-bias would be less likely to support environmental taxation than individuals with consistent time preference”.
- t_fut_bi - among the few researches on future-bias in the field, individuals with such time-preference were assumed to behave in the opposite direction than those with present-bias. However, in this study, we can see the coefficient of t_fut_bi is also negative but insignificant (p-value = 0.1055). As of now, I will call this observation inconclusive and would worth further investigation.
- envhabit - unsurprisingly, the higher an individual’s environmental morale, the more often environmentally friendly behaviour would be conducted, and the more likely environmental tax would be supported (p-value = 0.0037).
- g_agegr10_dv - not the focus of this study, but the negative coefficient (p-value = 0.0856) might indicates younger generation would treat environmental measures more seriously.
- g_sex_male - with a p-value of 0.5498, gender probably has nothing to do with individuals' decision on paying environmental tax. Even if a correlation is observed, it would be out of randomness.
Intuitively, the higher the discount rate, the lower the Net Present Value associated to future climate mitigation, and the less likely an individual would trade off present consumptions for future well-being (US Environmental Protection Agency, 2010).
In other words, subjects with present-bias would value more on approaching events such as consumption sacrificed for paying additional tax, and weight less on the distant utility like benefits of climate mitigation.
Interpretation - Visualisation
An alternative approach to understand the correlations is by visualisations.
In [40]:
# Install the R package for visualisation
install.packages("effects")
# call the R package
library(effects)
# plot the effect of t_pre_bi on envtax
plot(Effect(focal.predictors = "t_pre_bi",o_logit))
From the graph above, we can see participants with present-bias are less likely to be "Very Willing" (=4) and "Fairly Willing" (=3) to pay environmental tax; while more likely to be "Not Very Willing" (=2) and "Not At All Willing" (=1).
In [41]:
# plot the effect of std_comp on envtax
plot(Effect(focal.predictors = "std_comp",o_logit))
It is also obvious from above that individuals' preference to be "Very Willing" (=4) and "Fairly Willing" (=3) is decreasing with their rate of time discount but increasing with "Not Very Willing" (=2) and "Not At All Willing" (=1).
5. Last word
Although procrastinators (present-biased) are proved to be less likely to support environmental tax, this is a simplified version of my undergrad dissertation. The main purpose is to practice my Python and R programming so those technicality such as robustness checking the proportional odds assumption in the ordered logistics model are not presented.
If you are interested, feel free to reach out for the full version of my work (got a solid first grade btw). In that dissertation I have not only compare the output with other analysis methods like Ordered Probit regression, I have also checked the models' heteroskedasticity, parallel lines assumption, multicollinearity, goodness of fit and have justified my first hypothesis with a significant result (coefficient = -2.50; p-value = 0.012).
If you have made this far... thank you and stay tuned for more projects!
6. Appendix
Appendix A: Full Question Versions in “Environmental Tax” Module
- How willing would you be to pay higher taxes in order to protect the environment?
- How willing would you be to pay higher taxes in order to protect the environment, if the government reduced other taxes you pay by the same amount?
- How willing would you be to pay higher taxes on things you buy that pollute the environment?
- How willing would you be to pay higher taxes on things you buy that pollute the environment, if the government reduced other taxes you pay by the same amount?
- How willing would you be to pay higher taxes, if the government spent the extra money on protecting the environment?
- How willing would you be to pay higher taxes, if the government spent the extra money on protecting the environment and reduced other taxes you pay by the same amount ?
- How willing would you be to pay higher taxes, if the government promised it would spend the extra money on protecting the environment?
- How willing would you be to pay higher taxes, if the government promised it would spend the extra money on protecting the environment and reduce other taxes you pay by the same amount?