Son preference in Indian Families: Absolute Versus relative Wealth Effects

Sylvestre Gaudin

ONLINE SUPPLEMENTS

Supplement 1. DATA

1. Data Manipulations and Selection Issues

1. 1. Data Notes

Table S1. Observations per Household: Effective Sample

1. 2. Construction of the Dependant Variable and Data Selection Issues

Table S2. Distribution by wealth quintiles before and after selection on SP.

Table S3. Distribution by wealth quintiles with and without missing observations on independent variables.

Table S4. Distribution by wealth quintiles before and after removing small PSU’s.

Table S5. Distribution by wealth quintiles before and after data selection.

Table S6. Distribution of educational attainment before and after data selection.

2. Construction of GDP/c

3. Modifications from B&Z on Individual-Level Independent Variables

Supplement 2. STABILITY TESTS FOR SMALL SAMPLE PCA

Table S7. Summary Statistics on Correlations Results between Full-sized and Reduced-Sample PC Scores from 50 Independent Randomized Household Selections

Figure S1. Median Correlations between Reduced Sample and Full Sample P.C. Results

Supplement 3. ALTERNATIVE LOCAL AREA GROUPINGS (NFHS-2)

Figure S2. Grouping of Households into Local Areas based on NFHS-2 Geographical Identifiers

Table S8. Constructed Local Areas from Household Level NFHS-2 Data

Table S9. Correlation Between PSU and Local Area PC Scores by PSU Size.

Supplement 4. ADDITIONAL RESULTS

1. NFHS-3 Multilevel Estimation Results

Table S10. Multilevel linear estimation of son preference models: NFHS-3

2. Discussion of Results on Ideal Number of Children and Its Square

3. Comparison of OLS, Logit and Ordered Logit: NFHS-2 and NFHS-3 Results

Table S11. Comparison of OLS, logit, and ordered logit specifications: NFHS-2

Table S12. Comparison of OLS, logit, and ordered logit specifications: NFHS-3

4. Random-Effect Logit with State Fixed Effects. NFHS-2 Results

Table S13. Son preference models: mixed effects logit estimation for NFHS-2

5. Educational Preference Model

Table S14. Raw distribution of answers on educational preferences

Table S15. Multilevel Linear Estimation of Stated Educational Bias: NFHS-2 Sample

Supplement 5. GEOGRAPHICAL REACH OF MARRIAGES

SUPPLEMENT 1

DATA

Table of Contents

1. DATA MANIPULATIONS AND SELECTION ISSUES

1.1. Data Notes

The NFHS has several recodes depending on whether the observation is the household, the individual, or the child. The household recode includes household-level data. It includes all sampled households even if there were no interview-able women in it. It is used to calculate the absolute and relative wealth scores (before selecting the sample of women that will serve as a base for the analysis). The number of households interviewed is larger than the number of women used in the analysis (92,486 households in 1998/99 and 109,041 in 2005/6). The procedure guarantees that representative base is used to evaluate relative wealth in each PSU.
Although NFHS-3 includes men and never married women as well as ever-married women, the sample is restricted to match the characteristics of the NFHS-2 sample. After excluding visitors (the de jure sample is used), there are 84,348 ever-married women in the 1998/9 sample and 89,189 in 2005/6, all between the ages of 15 and 49.
PSU’s are normally formed of single villages but neighboring villages are grouped when single villages are too small. There are at least 50 households in each PSU.
Although multiple interviews are allowed per households, 76 (80) percent of the NFHS-2 (NFHS-3) sample include single observations per households and 95 (96) percent of the samples include households with no more than 2 interviewed ever-married women.
Due to missing or unusable information on key variables used in the analysis, the base data set is reduced to 77,886 women in 68,114 households in NFHS-2 and 83,785 women in 75,343 households in NFHS-3. The total number of observations used in the pooled analysis is therefore 161,671 women in 143,457 households.
Table S1 reports the number of observation per household in each sample.

Table S1. Observations per household: effective sample

Observations per	NFHS-2		NFHS-3
Household	Frequency	Cumulative %	Frequency	Cumulative %
1	59,143	75.94	67,529	80.6
2	14,417	94.45	13,213	96.37
3	3,378	98.78	2,390	99.22
4	766	99.77	484	99.8
5	122	99.92	126	99.95
6	25	99.96	36	99.99
7	35	100	7	100
Total cases	77,886		83,795

1. 2. Construction of the Dependent Variable and Data Selection Issues

Removal of unusable observations on ideal family questions. Non numerical answers to the questions on ideal number of children were dropped (4,481 observations in 1998/9 and 2,083 in 2005/6.) In both NFHS samples, dropped observations due to non-numerical answers are uniformly distributed among all levels of education except for women with no education (0 years). Women with no education account for 71 percent of the missing answers in the reduced sample as opposed to 51 percent in the full 1998/9 data. In 2005/6 they account for 69.79 of missing answers in the reduced sample instead of 40.05 percent in the full data. The same selection applied by wealth quintiles in both samples with a larger proportion of non-numerical answers in poorer households. However, the observations dropped make up only 5.31% of the data in 1998/9 and 2.34% in 2005/6, small enough to justify ignoring the selection bias.

Answers on ideal number of children. Son preference was evaluated using the difference between ideal number of boys and ideal number of girls. If no answer was given for ideal number of daughters or sons but the same answer was given for ideal family size and ideal number of either sex (total), the case was classified as zero son preference. The difference between ideal number of boys and ideal number of girls ranged from -6 to +20 in 1998/9 with most of the data between -1 and +2. In 2005/6 the range was reduced to -9 to +10 (although the outliers in 1998/9 were in such small number that no conclusion can be drawn from this remark alone) with most of the data still between -1 and +2.

Ideal Total. The difference was divided by the number of children in the ideal family size. Cases with an ideal family size of zero --the woman reported preferring zero sons, zero daughters, and zero of either sex-- were dropped (83 cases in 1998/9 and 895 in 2005/6). Finally, I dropped obvious outliers reducing the maximum Ideal-Total to 10 (it goes up to 20 but ideal number of children larger than 10 constitute only 0.02 percent of the data.)

Check for selection bias. The distribution of the data by wealth quintiles was checked with or without missing and dropped observations on the son preference variable. Most of the dropped observations are due to non-numerical answers (Table S2).

Table S2. Distribution by wealth quintiles before and after selection on SP.

All India	NFHS-2			NFHS-3
Wealth quintile	% before	% after	% before		% after
1	15.62	15.23	11.65		11.49
2	16.99	16.63	14.7		14.54
3	19.55	19.48	19.66		19.59
4	23.09	23.30	24.47		24.54
5	24.76	25.37	29.53		29.84
Based on N =	84,348	79,731	89,189		86,176

Missing observations on independent variables. Additional observations (1,185 or 1.5% of observations in 1998/9 and 6,303 or 7.3 percent of observations in 2005/6) were automatically dropped because of missing observations in 4 variables used in explaining son preference: acres of cultivated land, years of education, and partner’s education. These missing observations were evenly distributed in the data in terms of son preference as well as wealth quintiles. The mean and standard deviation of SP stay at 0.13 and 0.27, repectively. The distribution of the data by wealth quintiles is compared before and after the drop (Table S3)

Table S3. Distribution by wealth quintiles with and without missing observations on independent variables.

All India	NFHS-2		NFHS-3
Wealth quintile	% before	% after	% before	% after
1	15.23	15.25	11.49	11.44
2	16.63	16.65	14.54	14.53
3	19.48	19.52	19.59	19.59
4	23.30	23.35	24.54	24.55
5	25.37	25.22	29.84	29.88
Based on N =	79,731	78,782	86,176	84,871

Households in small PSU’s. Households in PSU<15 in sample size were dropped (after checking stability of PCA). The mean and standard deviations of the dependent variable remained unchanged after the drop. The distribution of the data by wealth quintiles does not change significantly either (Table S4).

Table S4. Distribution by wealth quintiles before and after removing small PSUs.

All India	NFHS-2		NFHS-3
Wealth quintile	% before	% after	% before	% after
1	15.25	15.31	11.44	11.54
2	16.65	16.72	14.53	14.54
3	19.52	19.55	19.59	19.47
4	23.35	23.33	24.55	24.51
5	25.22	25.09	29.88	29.95
Based on N =	78,782	77,886	84,871	83,785

Overall Selection. Comparison of Wealth Quintiles (Table S5) and Education Levels (Table S6) before and after data selection

Table S5. Distribution by wealth quintiles before and after data selection

All India	NFHS-2			NFHS-3
Wealth quintile	% de jure sample	% final sample	% de jure sample		% final sample
1	15.62	15.31	11.65		11.54
2	16.99	16.72	14.7		14.54
3	19.55	19.55	19.66		19.47
4	23.09	23.33	24.47		24.51
5	24.76	25.09	29.53		29.95
Based on N =	84,348	77,886	89,189		83,785

Table S6. Distribution of educational attainment before and after data selection

	NFHS-2		NFHS-3
Educational attainment	% de jure sample	% final sample	% de jure sample	% final sample
No education	50.48	49.47	39.91	39.27
Incomplete primary	9.71	9.67	8.7	8.53
Complete primary	7.36	7.47	6.94	6.94
Incomplete secondary	16.28	16.73	31.27	31.71
Complete secondary	7.15	7.32	4.78	4.91
Higher	9.03	9.34	8.39	8.63
Based on N =	84,325	77,886	89,183	83,785

The following two sections give additional information on the construction of independent variables not essential to the comprehension of the paper.

2. CONSTRUCTION OF PER CAPITA STATE DOMESTIC PRODUCTS

The figures come from a table of Gross State Domestic Product in current prices compiled by the Central Statistical Office of the Ministry of Statistics and Program Implementation of the Government of India; they are translated into constant 2001 prices using the CSO’s Consumer Price Index for Industrial workers by centers (Annual Report 2006, CSO, Table 3). For states with more than one “center”, a weighted average of state centers is used. This is feasible because the CSO table provides the weight of each center within the corresponding state. A few small states do not have centers for measurement of CPI, in which case, the CPI of the largest closest state is used. Per capita calculations are simply made using the 1991 and 2001 state populations from the Indian Census. To give an idea of levels and to check consistency, values of constant GDP/c are checked using the 2008 dollar exchange rate. For 1999 the highest per capita GPD is USD 1,040 in the Capital Delhi and USD 187 in the poorest state, Bihar. In 2005, the highest is found in Goa with USD 2,338 and the lowest still Bihar with USD 254.

It is important to note that state delineations were changed in 2000. The NFHS-3 sample includes 29 states instead of the 26 in NFHS-2: Jharkhand was split out of Bihar, Chattisgarh out of Madhya Pradesh, and Uttaranchal (later Uttarkhand) out of Uttar Pradesh. This means, for example, that a household located in the area of Jharkhand in 1999 is assigned the state per capita GDP of Bihar while it is assigned the per capita GDP of Jharkhand in 2005. This is not a minor issue considering that, in 2005, Jharkhand has about 2.5 times the per capita income level of Bihar, Uttaranchal has nearly twice the per capita GDP of Uttar Pradesh, and Chattisgarh’s per capita GDP is one fourth higher than that of Madhya Pradesh.

3. INDIVIDUAL-LEVEL VARIABLES: MODIFICATIONS FROM B&Z

a. Education of the woman and her partner are measured in years of education instead of the categorical variables chosen by Bhat and Zavier (2003). The choice was made for two reasons: first because definitions of categories between the two surveys were changed; and second, to simplify the exposition and better focus on the variables of primary interest.

b. In addition to years of education, a literacy variable is included to account for the large number of interviewed women with no education (about 55% in NFHS-2 and 43% of NFHS-3), the variable takes the value of 1 if the woman is illiterate, 0 otherwise. When comparing results across periods, however, it is important to know that definitions of literacy changed between the two surveys. In NFHS-2, a woman who answered no to “can you read and write?” was marked illiterate. In NFHS-3, the woman was asked whether she could read all or part of a sentence from a literacy card; women who could not read at all were coded as illiterate. Unfortunately, there is no way to tell how the two variables compare exactly; in NFHS-2, those who said they could not read or write may have been able to read part of the NFHS-3 sentence, which would inflate illiteracy relative to NFHS3. On the other hand, there was no question in NFHS-2 to check reading skills and women may have wrongly declared being able to read and write.

c. Age. The squared term for the respondent’s age was dropped. Preliminary tests indicated that the effect of age goes primarily through its correlation with total number of children; given the definition of SP, including the square of the mother’s age did not improve the fit of the model and had no impact on other results.

SUPPLEMENT 2

STABILITY TESTS FOR SMALL SAMPLE PCA

Although principal components analysis (PCA) is widely recognized as a means to create wealth indices from survey data, the analysis is normally performed using large sample sizes. When the comparison base is a community, the number of households interviewed in the same community may be small, even if it constitutes a random sample for that community as is the case here for Primary Sampling Units (PSUs). The literature does not provide information about a minimum sample size necessary to correctly rank households using PCA. In order to evaluate whether the method correctly ranks households in smaller samples, I ran simulation tests using all PSU’s with sample size N≥55 (63 PSUs in NFHS-2 and 99 in NFHS-3). The following procedure was followed:

i. PCA is performed for the full size PSUs as in the main analysis; results are recorded in PCALL for scores and PCQALL for quintiles.

ii. Numbers from 1 to N/5 are randomly assigned to household in each quintile (regardless of how they ranked in score); samples are reduced to n=50, 40, 35, 30, 25, 20, 15, 10, and 5 households by keeping households numbered 1 through n/5 for each value of n.

iii. The same PCA is run for each sample size; scores are recorded in PCn and quintiles in PCQn.

Using this procedure for all PSU’s, ten different wealth scores and quintiles based on sample sizes from the full N≥55 down to five households (one per original full-sample quintile) are obtained. Because household selection in each quintile makes a difference in terms of resulting PC scores, the procedure is repeated k=50 time, each time with a new randomization of the ordering of households in each quintile (adding more runs did not change summary statistics of correlation coefficient). Principal components scores thus obtained are recorded in PCn_k and PCQn_k, k=1, 2, …50. Correlation coefficients between PCALL and PCn_k (r^s_nk) and between PCQALL and PCQn_k (r^q_nk), k=1, 2, … 50, are calculated for all n, based on the five households per PSU who obtained the number 1 in the random ranking in the k^th run and recorded in the random variables r^q_n and r^q_n.

Table S7 gives summary statistics r^q_n and r^q_n by sample size n. Correlations of quintiles are slightly lower than correlations of scores due to threshold effects but correlations with the full sample results still average above 90 for all n>20 and above 80 for n>10. Results are very similar across the two samples. The median correlation is virtually identical to the mean in all cases; Figure S1 represents the gradual change in median correlations between full-sample and reduced-sample scores and quintiles as sample sizes are reduced. The tests reveal that the principal components procedure is relatively stable to the number of households in the sample. Correlations decrease with the size of the sample but there are no obvious breaks, down to samples sizes of 10. The rate at which deterioration occurs increases slightly when sample sizes get below 25 (although the range of y-values chosen emphasizes the magnitude of the deterioration.)

Table S7. Summary statistics on correlations between full-sized and reduced-sample principal components scores from 50 independent randomized household selections.

	NFHS2 (based on 63 PSUs)				NFHS3(based on 99 PSUs)
Sample sizes (n)	Min	Max	Mean	St. dev.	Min	Max	Mean	St. dev.
50 (10 per quintile)	.96	1	.99	.01	.96	1	.99	.01
40	.92	.99	.98	.02	.95	.99	.98	.01
35	.92	.99	.98	.02	.92	.99	.97	.01
30	.87	.99	.97	.02	.87	.98	.96	.02
25	.86	.98	.96	.02	.91	.98	.95	.02
20	.85	.97	.94	.02	.86	.97	.93	.02
15	.84	.96	.92	.03	.85	.95	.91	.03
10	.77	.94	.90	.03	.80	.93	.89	.03
5	.78	.91	.85	.03	.76	.89	.83	.03

Note: Correlations for different size samples are calculated using the 5 observations per PSU with calculated principal components scores at all levels, so 315 observations in NFHS-2 and 495 in NFHS-3 are used to calculate the correlations. The number of observations is the same across the 50 runs but the households are different.

Fig. S1. Median correlations between reduced- and full-sample principal components results

SUPPLEMENT 3

ALTERNATIVE LOCAL AREA GROUPINGS FOR NFHS-2

Although not available in NFHS-3, the village/town, thesil, and district of residence are recorded for each household in NFHS-2. Instead of using the PSU as local base to calculate relative wealth scores, households in the NFHS-2 sample could be grouped into “local areas” based on these geographical identifiers (the grouping was not possible with PSU’s because PSU’s were numbered irrespective of location). Choosing groupings by area rather than PSU allowed larger sample sizes and more variation in the size of comparison groups in line with population densities. Samples ranged from 30 (by construction) to 2,435. The largest samples represented areas identified as a single village/town. The less densely populated an area, the more likely grouping involved full thesils or districts. This may be good to evaluate relative wealth if households in low density areas tend to position themselves relative to larger geographical areas. One drawback of grouping by geographical identifiers, however, is that the resulting samples are no longer statistically representative of the area. The procedure to create local areas required intensive detailed manipulation of the data set. The principle followed consisted in finding the closest grouping with at least 30 households. In a few cases, this implied grouping two districts together but in most cases, groupings remained within one district and in more than half the data, a single village/town. Figure S2 represents the algorithm used to find the smallest geographical area with samples of at least 30 households. Single village/towns, thesils, and districts are denoted respectively as Li, Ti, and Di, i being the identifier corresponding to the local residence of the household considered. Neighboring villages/town/thesils/and districts could be identified if they were numbered consecutively

Fig. S2. Grouping of households into local areas based on NFHS-2 geographical identifiers

Table S8 indicates the extent to which local areas were aggregated to find the local base for each household. About half of the households were compared to their own village/town area without need for aggregation and 60% were compared to geographical areas smaller than the tehsil. For 10% of households, district-level comparisons were necessary. The average number of households in the local area samples used for PCA is 206, the median is 52, with a minimum of 30 (by construction) and a maximum of 2,435.

Table S8. Constructed local areas from household level NFHS-2 data

Lowest level retained for PCA	Number of households in area used as base for PCA
Lowest level retained for PCA	Total	%	Min	Max	Mean
Village/town (L1)	50,333	54.42	30	2435	321
Joined villages/towns (L2 to L5)	6,384	6.9	30	88	48
Tehsil (T1)	16,135	17.45	30	356	64
Joined tehsils (T2)	10,347	11.19	36	178	71
District (D1)	6,956	7.52	30	220	85
Joined districts (D2)	2,331	2.52	31	206	87

Principal components scores and quintiles were calculated using these comparison groups. Table S9 gives correlation between the relative wealth scores calculated at the PSU level and those calculated using local identifiers for different PSU sizes. Obviously, correlations are much lower at low sizes because the lower the size the more likely a much wider geographical base was used (such as the district).

Table S9 Correlations between PSU and local area principal scores by PSU sample size

Sample size in PSU	No. of PSUs	No. of households	Correlation coeff.
All sizes	3,215	92,486	0.87
≥50	112	6,181	0.94
40-49	232	10,086	0.93
35-39	312	11,462	0.93
30-34	637	20,208	0.94
25-29	868	23,487	0.81
20-24	650	14,503	0.79
15-19	316	5,429	0.76
10-14	85	1,106	0.66
<10	3	24	not significant

SUPPLEMENT 4

ADDITIONAL RESULTS

1. NFHS-3 LINEAR MULTILEVEL ESTIMATION RESULTS

The article includes tables for NFHS-2 and pooled samples. Full estimation results for NFHS-3 are given below (Table S10)

Table S10. Multilevel linear estimation of son preference models: NFHS-3

	Model
Independent variable	MW1	MWR1	MWR2	MWR3
State level:
GSP/c	-0.00039*	-0.00039*	-0.00037*	-0.00036*
	(0.068)	(0.069)	(0.085)	(0.088)
Household level:
W	-0.049***	-0.050***	-0.068***	-0.064***
WR			0.014**	0.0039
			(0.005)	(0.531)
WR × No land				0.081
				(0.152)
WR × Land				0.016***
				(0.001)
Land acres × Urban (×100)		-0.017	-0.018	-0.024
		(0.323)	(0.293)	(0.168)
Land acres× Rural (×100)		0.0096	0.0042	0.0018
		(0.569)	(0.803)	(0.916)
Individual level
Illiterate	0.0034	0.0035	0.0037	0.0038
	(0.278)	(0.263)	(0.237)	(0.225)
Education, self	-0.0019***	-0.0018***	-0.0018***	-0.0018***
Education, partner	-0.00035	-0.00032	-0.00035	-0.00039
	(0.148)	(0.185)	(0.152)	(0.114)
Paid work	-0.0055***	-0.0055***	-0.0052**	-0.0051**
	(0.008)	(0.008)	(0.012)	(0.013)
Other work	0.010***	0.0098***	0.0093***	0.0090**
	(0.004)	(0.006)	(0.008)	(0.011)
Media exposure	-0.0069***	-0.0068***	-0.0069***	-0.0070***
	(0.003)	(0.001)	(0.003)	(0.003)
Religion: Ref. Hindu
Muslim	-0.0046	-0.0044	-0.0043	-0.004
	(0.148)	(0.164)	(0.175)	(0.207)
Sikh	0.012	0.011	0.012	0.011
	(0.144)	(0.154)	(0.151)	(0.172)
Christian	-0.030***	-0.030***	-0.030***	-0.030***
Other	-0.0028	-0.0028	-0.0024	-0.0023
	(0.614)	(0.622)	(0.665)	(0.680)
Scheduled Caste	0.0021	0.0021	0.0023	0.0026
	(0.401)	(0.397)	(0.357)	(0.301)
Scheduled Tribe	-0.0082**	-0.0087**	-0.0091**	-0.0092**
	(0.022)	(0.015)	(0.011)	(0.011)
Age (respondent)	0.000069	0.000083	0.000088	0.000089
	(0.594)	(0.521)	(0.497)	(0.491)
Sons	0.025***	0.025***	0.025***	0.025***
Daughters	-0.014***	-0.014***	-0.014***	-0.014***
Sons- dead	0.0062***	0.0062***	0.0062***	0.0062***
	(0.002)	(0.002)	(0.002)	(0.002)
Daughters- dead	-0.0021	-0.0022	-0.0022	-0.0022
	(0.336)	(0.309)	(0.308)	(0.309)
Ideal- total	0.0026	0.0029	0.0027	0.0027
	(0.416)	(0.367)	(0.386)	(0.391)
Ideal- total squared	-0.0011	-0.0011	-0.0011	-0.0011
	(0.015)	(0.012)	(0.012)	(0.013)
Odd ideal	0.17***	0.17***	0.17***	0.17***
Fixed Effects
Region: Ref. East
North	0.029***	0.029***	0.029***	0.029***
	(0.004)	(0.004)	(0.004)	(0.004)
Central & West	0.027**	0.027**	0.027**	0.027**
	(0.025)	(0.025)	(0.025)	(0.025)
South	-0.035***	-0.035***	-0.035***	-0.034***
	(0.004)	(0.003)	(0.003)	(0.004)
Urban Residence	-0.012***	-0.012***	-0.0097***	-0.0087***
			(0.001)	(0.002)
Constant	0.073***	0.073***	0.072***	0.071***
Random components (standard deviations)
Level 1: State	0.020***	0.020***	0.020***	0.018***
Level 2: PSU	0.037***	0.037***	0.037***	0.037***
Level 3: Household	0.022***	0.020***	0.020***	0.020***
Residual error	0.25***	0.25***	0.25***	0.25***
Regression Statistics
Akaike Information Criterion	2,966	2,937	2,931	2,929
Nested groups (unbalanced) States Local areas (PSUs) Households	29 3,722 75,343	29 3,722 75,343	29 3,722 75,343	29 3,722 75,343
N	83,785	83785	83785	83785

note: p-values in parentheses (p>|z|), omitted when p<.001.

*p<0.10 ** p<0.05 ***p<0.01

2. COEFFICIENTS ON IDEAL CHILDREN AND ITS SQUARE: DISCUSSION

Coefficients on ideal-total are negative and significant in NFHS-2 and the pooled sample but insignificant when using NFHS-3 alone. Coefficients on the squared term are positive and significant in the reported results but significantly negative in the NFHS-3 results (reported above, Table S10). Alternative estimations with logit and ordered logit yielded significant results in line with B&Z (see section D3). However, B&Z’s continuous variable, also proportional to ideal family size yielded the same direction of effect with OLS as with logit. It appeared that results on ideal-total were very sensitive to the specification of the dependent variable and the data set used. This deserved further inquiry…

To understand the discrepancy, I estimated the model with alternative dependent variable. When the independent variable was calculated as ideal-boys divided by ideal-total, as in B&Z, signs were found to coincide with OLS, i.e. a positive sign on ideal-total and a negative sign on the squared term. A small modification to the variable, however, inverted the signs (although, importantly for this article, signs and significance levels on other variables were unchanged). The modification concerned the cases when women gave the same answer to the ideal number of either sex as the ideal number of children, for which the value of the dependent variable was 1/2 in B&Z, instead of zero here. Note that these responses are not less son-preferring or more girl-preferring than responses with equal ideal number of boys and girls in their ideal family so a score of zero makes sense.

3. COMPARISON OF OLS, LOGIT, AND ORDERED LOGIT

Two alternative dependant variables measuring son-preference are constructed resembling those used in the literature. The first one is a binary variable that takes the value of 1 when a woman indicated more sons than daughters in her ideal family composition. The second variable is an ordered categorical variable constructed by grouping the continuous variable into three categories according to the difference between ideal number of boys and ideal number of girls (D): it takes the value of 0 if D≤0, 1 if D=1, and 2 when D³2 . The estimation is done without taking account of the hierarchical structure of the data but standard errors do take account of the complex survey design with clustering at the PSU level, and strata coinciding with rural and urban areas of states. Results for the pooled sample are presented in the article. This supplement gives additional tables for the NFHS-2 (Table S11) and NFHS-3 (Table S12) separate results. Coefficient estimates for the logit and ordered logit regressions are reported in odd ratios. This is useful in that it gives a better idea of the relative magnitude of different effects; it is also easier to compare with B&Z’s logit results. To compare with OLS estimates, however, one must remember that values less than one correspond to negative signs in the linear regression. Results of the logit and ordered logit models give the same significance level and direction of effect for all the variables pertaining to the hypothesis of this paper.

Table S11. Comparison of OLS, logit, and ordered logit specifications: NFHS-2

	Estimation method^a
Independent Variables	OLS	Logit^b	Ordered logit^b
GSP/c	-0.0016***	0.98***	0.98***
W	-0.047***	0.53***	0.65***
WR (PSU)	0.00275	1.03	0.996
	(0.718)	(0.799)	(0.955)
WR×Land	0.015***	1.28***	1.19***
	(0.006)	(0.001)
Land acres×urban (×100)	-0.00086	0.62	0.81
	(0.943)	(0.163)	(0.249)
Land acres×rural (×100)	0.036***	1.58***	1.28***
		(0.008)	(0.0008)
Illiterate	0.0081*	1.02	1.02
	(0.058)	(0.717)	(0.722)
Education, self	-0.0028***	0.95***	0.97***
Education, partner	-0.00022	1.00	0.998
	(0.456)	(0.611)	(0.518)
Paid work	-0.0073**	1.005	1.01
	(0.016)	(0.899)	(0.712)
Other work	0.0057	1.07	1.05***
	(0.115)	(0.127)	(0.161)
Media Exposure	-0.014***	0.87***	0.90***
Religion: Ref. Hindu
Muslim	-0.015***	0.74***	0.85***
Sikh	0.034***	1.43***	1.31***
Christian	-0.0067	0.73***	0.78***
	(0.334)	(0.003)	(0.002)
Other	-0.0170	0.85	0.88
	(0.268)	(0.195)	(0.162)
Scheduled caste	0.0034	1.03	1.02
	(0.263)	(0.541)	(0.491)
Scheduled tribe	-0.022***	0.72***	0.80***
Age (respondent)	-0.00034**	0.989***	0.995***
	(0.034)
Sons	0.025***	1.37***	1.26***
Daughters	-0.015***	0.84***	0.87***
Sons, dead	0.010***	1.11***	1.08***
Daughters, dead	-0.0027	0.99	0.98
	(0.214)	(0.716)	(0.364)
Ideal-total	-0.0015**	2.91***	2.25***
	(0.023)
Ideal-total squared	0.0015**	0.91***	0.95***
	(0.057)
Odd ideal	0.18***	37.89***	13.51***
Region: Ref. East
North	0.047***	1.79***	1.49***
West	0.030***	1.36***	1.25***
South	-0.034***	0.43***	0.56***
Urban Residence	-0.0066*	0.95	0.98**
	(0.081)	(0.333)	(0.546)
Constant	0.142***	0.026***
Constant 0-1			28***
Constant 1-2			390***
Regression Statistics
N	77,886	77.886	77.886
F	554	406	375
R-Squared	0.18	--	--

Note: p-values (p>|z|) in parentheses below the coefficient estimate, omitted when p<.001.

^a All standard errors corrected for group heteroskedacticity caused by the NFHS complex survey design. Strata are rural/urban areas of each state in each NFHS-sample. Household effects are ignored.

^b Odd ratios reported; numbers <1 indicate negative relationships.

* p<0.1; ** p<0.05; *** p<0.01

Table S12 Comparison of OLS, logit, and ordered logit specifications: NFHS-3

	Estimation method^a
Independent Variables	OLS	Logit^b	Ordered logit^b
GSP/c	-0.00028***	0.99***	0.996***
	(0.003)
W	-0.026**	0.65**	0.76**
	(0.032)	(0.022)	(0.044)
WR (PSU)	0.0013	1.06	1.003
	(0.870)	(0.654)	(0.973)
WR×Land	0.013**	1.20**	1.18***
	(0.014)	(0.028)	(0.007)
Land acres×urban ×100	-0.042	0.66	0.68
	(0.114)	(0.389)	(0.260)
Land acres×rural×100	0.033	1.31	1.31
	(0.167)	(0.513)	(0.293)
Illiterate	-0.0020	0.95	0.95
	(0.618)	(0.367)	(0.206)
Education, self	-0.0031***	0.95***	0.96***
Education, partner	-0.00062**	0.99	0.994**
	(0.037)	(0.263)	(0.048)
Paid work	-0.0027	0.97	0.99
	(0.334)	(0.496)	(0.692)
Other work	0.016***	1.19**	1.15***
	(0.001)	(0.01)	(0.004)
Media exposure	-0.0095***	0.89***	0.92***
		(0.003)	(0.003)
Religion: Ref. Hindu
Muslim	-0.010***	0.75***	0.86***
	(0.004)
Sikh	0.033***	1.43***	1.36***
		(0.004)	(0.002)
Christian	-0.025***	0.57***	0.63***
Other	-0.024***	0.90	0.86
	(0.006)	(0.440)	(0.142)
Scheduled caste	0.0023	1.05	1.05
	(0.447)	(0.283)	(0.165)
Scheduled tribe	-0.0050	0.88*	0.93
	(0.313)	(0.077)	(0.142)
Age (respondent)	-0.000067	0.996	0.998
	(0.67)	(0.105)	(0.255)
Sons	0.022***	1.35***	1.24***
Daughters	-0.012***	0.84***	0.87***
Sons, dead	0.0041*	1.05	1.04
	(0.054)	(0.106)	(0.103)
Daughters, dead	-0.000034	0.998	0.998
	(0.989)	(0.963)	(0.916)
Ideal-total	-0.00089	2.57***	2.35***
	(0.905)
Ideal-total squared	0.000022	0.89***	0.95***
	(0.982)
Odd ideal	0.19***	42.4***	21.52***
Region: Ref. East
North	0.0078**	1.14**	1.07*
	(0.043)	(0.022)	(0.077)
West	0.011***	1.24***	1.16***
	(0.009)		(0.001)
South	-0.044***	0.40***	0.48***
Urban Residence	-0.0074**	0.86	0.90**
	(0.032)	(0.006)	(0.012)
Constant	0.082***	0.011***
Constant 0-1			51***
Constant 1-2			954***
Regression Statistics
N	83,785	83,785	83,785
F	421	344	300
R-Squared	0.17	--	--

Note: p-values (p>|z|) are in parentheses below the coefficient estimate, they are omitted when p<.001.

^a All standard errors corrected for group heteroskedacticity caused by the NFHS complex survey design. Strata are rural/urban areas of each state in each NFHS-sample. Household effects are ignored.

^b Odd ratios reported; numbers <1 indicate negative relationships.

* p<0.1; ** p<0.05; *** p<0.01

4. RANDOM EFFECTS LOGIT ESTIMATION WITH STATE DUMMIES.

The estimation treats household effects as random and state effects as fixed. Standard errors are not corrected for clustering at the PSU level. This is not the ideal multilevel procedure (multilevel logit estimation was not feasible for such a large data set and complex structure of the model given our computer resources at the time), but it gives us a good idea whether the multilevel results in the text suffer from a less than ideal distribution of the dependent variable.

Here the dependent variable is dichotomous, either the women declared preferring more sons or not. Results of the analysis are reported in Table S13. As above, results are reported in odd ratios so that the magnitude of effects can be easily compared. An estimated odds ratio below 1 is equivalent to a negative relationship while an odds ratio above 1 indicates a positive relationship. For b_o <1, the lower b_o greater the effect; the opposite is true for b_o >1, although the relative probabilities cannot directly be compared.

All variables of interest (wealth-related) get the same direction of effect and higher significance level. In wealthier households (in absolute terms), the odds of being son-preferring are found to be less than one half the odds of expressing no son preference, on average. The impact of absolute wealth in reducing the odds of son-preference is found much larger at the household level than for state wealth. The strength of the relationship increases by 4% when only land ownership is controlled for, it increases by 22% when relative wealth and land ownership are both in the estimation. The Akaike information criteria reveal the same pattern as the multilevel linear estimation.

Table S13. Son preference models: mixed-effects logit estimation for NFHS-2

	Model
Independent variable	MW-1	MWR-1	MWR-2	MWR-3
GSP/c	0.984***	0.984***	0.985***	0.986***
W	0.464***	0.443***	0.345***	0.375***
WR			1.22***	1.03
			(0.003)	(0.681)
WR×Land				1.23***
				(0.001)
Land acres × Urban (×100)		1.0002	1.0002	1.0001
		(0.351)	(0.377)	(0.789)
Land acres × Rural (×100)		1.0006***	1.0005***	1.0005***

Illiterate	0.969	0.972	0.977	0.982
	(0.533)	(0.568)	(0.646)	(0.713)
Education, self	0.956***	0.957***	0.958***	0.959***
Education, partner	0.996	0.996	0.995	0.994*
	(0.267)	(0.220)	(0.163)	(0.098)
Paid work	0.956	0.962	0.967	0.969
	(0.174)	(0.23)	(0.307)	(0.337)
Other work	1.09**	1.07*	1.07*	1.06
	(0.030)	(0.059)	(0.077)	(0.134)
Media exposure	0.944*	0.945*	0.943*	0.939*
	(0.082)	(0.089)	(0.076)	(0.057)
Religion: Ref. Hindu
Muslim	0.814***	0.820***	0.821***	0.829
Sikh	1.35***	1.35***	1.35***	1.33***
	(0.004)	(0.004)	(0.005)	(0.007)
Christian	0.649***	0.649***	0.648***	0.647***
Other	0.959	0.958	0.962	0.965
	(0.646)	(0.641)	(0.675)	(0.697)
Scheduled caste	0.979	0.987	0.992	1.0015
	(0.557)	(0.715)	(0.822)	(0.966)
Scheduled tribe	0.767***	0.770***	0.764***	0.766***
Age (respondent)	0.990***	0.990***	0.990***	0.990***
Sons	1.41***	1.41***	1.41***	1.41***
Daughters	0.826***	0.825***	0.825***	0.825***
Sons, dead	1.13***	1.13***	1.13***	1.13***
Daughters, dead	0.973	0.973	0.973	0.973*
	(0.279)	(0.281)	(0.280)	(0.276)
Ideal-total	2.98***	2.98***	2.98***	2.97***
Ideal-total squared	0.911***	0.911***	0.911***	0.912***
Odd ideal	60.2***	60.3***	60.4***	60.2***
Fixed Effects
Urban Residence	0.859***	0.883***	0.906**	0.940
			(0.011)	(0.121)
Constant	0.039***	0.038***	0.037***	0.037***
State effects omitted (25)
Random Component (log variance)
Household effect	0.594**	0.595**	0.595**	0.590**
	(0.011)	(0.011)	(0.011)	(0.010)
Regression Statistics
LL	-25,318	-25,308	-25,303	-25,297
Akaike Information Criterion	50,735	50,718	50,711	50,701
N	77886	77886	77886	77886

Note: Coefficients reported as odds ratios; p-values in parentheses (p>|z|), omitted when p<0.001.

*p<.10 ** p<.05 ***p<.01

5. EDUCATIONAL PREFERENCE MODEL

Following the last question on ideal family size, the NFHS-2 questionnaire included the following questions:

“In your opinion, how much education should be given to girls these days?” followed by

“In your opinion, how much education should be given to boys these days?”

Answers to these questions were used to construct an alternative dependant variable measuring educational bias. Table S14 presents the raw distribution of answers

Table S14 Raw distribution of answers on educational preferences

	Frequency		Percent
Answer	Girls	Boys	Girls	Boys
No education	830	163	0.99	0.19
Less than primary	660	123	0.78	0.15
Primary	4,264	741	5.06	0.88
Middle	6,461	1,983	7.67	2.35
High school	15,626	8,069	18.55	9.58
Higher secondary	7,320	7,564	8.69	8.98
Graduate and above	6,930	9,245	8.23	10.97
Professional degree	3,605	6,425	4.28	7.63
As much as he/she desires	29,601	39,194	35.13	46.52
Depends	7,269	9,366	8.63	11.12
Don't know	1,686	1,379	2	1.64
Total	84,252	84,252	100	100

Answers were converted in approximate years of education (Y) up to Y=12 for higher secondary; Y=14 was used for anything above secondary. Answers “as much as he/she desired” were also given a value of 14. All answers that were exactly the same for boys and girls (including don’t knows and depends) were coded as 0 bias, other “Don’t know” and depends were dropped. The educational bias variable was calculated as

where subscript b is for boys and g is for girls. The variable is highly skewed toward more education for boys with less than 1% of the responses indicating higher education for girls (Figure S3). Mean and standard deviations of EduBias are respectively 0.09 and 0.18, the median and mode are zero.

Figure S3 Distribution of the educational bias variable (NFHS-2)

Although correlation between EduBias and SP is not as high as one would wish for an alternative dependent variable (r=.12), it is likely to capture a large part of the gender bias expressed in son preference. The model is run using the same linear multilevel method as for the SP model. Elasticities for the variables of interest are compared to SP elasticities in the article. Table S15 reports full results on coefficients and regression statistics.

Table S15 Multilevel linear estimation of stated educational bias: NFHS-2 sample

	Model
Level, variables, statistics	MW-1	MWR-1	MWR-2	MWR-3
State level:
GSP/c	-0.00184	-0.00184	-0.00181	-0.00184
(prob.<\|z\|)	(0.012)	(0.012)	(0.013)	(0.011)
Household level:
W	-0.109	-0.109	-0.12	-0.124
WR			0.00758
			(0.063)
WR×No land (×100)				0.0149
				(0.002)
WR×Land (×100)				0.00643
				(0.117)
Land acres × Urban		0.0135	0.0135	0.0185
		(0.253)	(0.253)	(0.121)
Land acres× Rural		-0.0029	-0.0043	-0.0011
		(0.682)	(0.542)	(0.877)
Individual level
Illiterate	0.0163	0.0163	0.0164	0.0163
Education, self	-0.00088	-0.000875	-0.000849	-0.000881
	(0.002)	(0.002)	(0.003)	(0.002)
Education, partner	-0.00227	-0.00228	-0.00229	-0.00226
Paid work	0.0079	0.00789	0.00802	0.00793
Other work	0.0130	0.0131	0.0129	0.0133
Media exposure	-0.00795	-0.00794	-0.00805	-0.00791
Religion: Ref. Hindu
Muslim	0.0102	0.0102	0.0103	0.00993
Sikh	-0.00186	-0.00186	-0.00195	-0.0013
	(0.734)	(0.735)	(0.722)	(0.812)
Christian	-0.00867	-0.00866	-0.00864	-0.00865
	(0.027)	(0.028)	(0.028)	(0.028)
Other	-0.0157	-0.0156	-0.0155	-0.0156
	(0.001)	(0.001)	(0.001)	(0.001)
Scheduled caste	0.0062	0.00618	0.00634	0.00597
	(0.001)	(0.001)	(0.001)	(0.001)
Scheduled tribe	0.00836	0.00834	0.00818	0.00809
	(0.002)	(0.002)	(0.002)	(0.002)
Age (respondent)	-0.00028	-0.000276	-0.000274	-0.000276
	(0.001)	(0.002)	(0.002)	(0.002)
Sons	0.00348	0.00348	0.00346	0.00347
Daughters	0.00625	0.00625	0.00624	0.00624
Fixed Effects
Region (Ref. East)
North	0.0537	0.0538	0.0542	0.0545
	(0.001)	(0.001)	(0.001)	(0.001)
Central & West	0.0755	0.0755	0.0755	0.0756
South	0.0192	0.0193	0.0194	0.0192
	(0.275)	(0.275)	(0.270)	(0.270)
Urban Residence	-0.0169	-0.0172	-0.0158	-0.0173
Constant	0.14	0.14	0.14	0.141
Random Components (Standard Deviations)
Level 1: State	0.029***	0.029***	0.029***	0.029***
Level 2: PSU	0.043***	0.043***	0.043***	0.043***
Level 3: Household	0.071***	0.071***	0.071***	0.071***
Residual error	0.141***	0.141***	0.141***	0.141***
Regression Statistics
Akaike Information Criterion	-60443	-60440	-60642	-60448
Nested groups (unbalanced) States Local areas (PSUs) Households	26 3,127 65,123	26 3,127 65,123	26 3,127 65,123	26 3,127 65,123
N	74,168	74,168	74,168	74,168

note: p-values in parentheses (p>|z|), omitted when p<.001.

*p<.1 ** p<.05 ***p<.01

SUPPLEMENT 5

GEOGRAPHICAL REACH OF MARRIAGES

The arguments linking son-preference to relative wealth via issues related to traditions of marriage depend greatly on the geographical reach of these marriages. In a world of perfect mobility with geographically unlimited marriage searches, there should be no effect of relative wealth, at least explained through issues of marriage, as the effect would be confounded with that of absolute wealth. Although the empirical analysis finds that relative wealth does have a significant effect, the role of each theoretical argument in explaining the magnitude of the relative wealth effect cannot be identified using the model developed here (possibly providing some direction for future research).

a. Evidence from anthropological and sociological literatures. What do we know about the geographical reach of marriages in India from other research? I found very scattered evidence spanning about 30 years (and I welcome additional information on this point) but all point to relatively small geographical reach for marriages (several neighboring villages within about 10-25 miles), although socioeconomic factors do seem to influence such distance. I present the evidence in chronological order.

Klass (1966) provides anthropological evidence on marriage in West Bengal where, as opposed to other parts of India there are no specific exogamous or endogamous rules. He points out that richer families can choose villages further away because they are less constrained monetarily, but the village cannot be too far however as there must be a trustworthy “Goodman”, preferably a kin, who can recommend the boy and boy’s family.

“[...] while the villager considers himself free to choose any locality, comparatively few marriages are contracted within the same village or between families of very distant villages. The majority of marriages are arranged with families of villages other than one's own, but within a radius of about five to ten miles.”( Klass 1961: 961)

Dutt, Noble, and Davgun (1981) report a radius of 25 miles for 80% of marriages in two village communities in Punjab (North India). They also find that marriage distance is affected by socioeconomic factors and development in general (with marriage distance increasing with economic development.)

Rosenzweig and Stark (1989) show that distant marriages are not a feature of richer families but a way to diversify risk. To test their hypothesis, they use a panel of six farm villages in three different agroclimatic regions in the semi-arid tropics of India. In their sample, the mean distance between village of origin and village where daughters marry is 30 kilometers (approximately 20 miles).

Babu and Naidu (1992) report mean marriage distances in three endogamous caste populations in Andhra Pradesh ranging between 17 and 40 kilometers (approximately 10 to 25 miles).

From rules of village and kin exogamy/endogamy, one could deduct that marriage distance must be lower in areas with endogamous rules (mostly the South) and smaller in areas with exogamous rules (North). However, Dalmia and Lawrence (2005) find no difference in average distance of marriage migration between the two regions.

b. Dynamics. As mentioned above, marriage rules in India are not specific about geographical distance. The role of status in promoting son preference could therefore weaken if it becomes easier to contract marriages further away, as would normally happen with economic development with better transportation and communication infrastructure. The extent to which families desire to contract marriages further away would also be likely to change. In addition, economic development could alter the role of families in marriages. Barber (2004) shows that, as non-family institutions such as schools, employment opportunities, and infrastructures develop individualistic attitudes toward marriage become prevalent. The empirical evidence is from the United States and cannot be directly applied to the case of India where the role of families in marriages is historically much stronger, but it would be consistent with a decrease in the role of relative status in son-preferring attitudes.

References

Babu, B.V. and J. M. Nadu. 1992. Marriage distance among four caste population in Andhra Pradesh. Man in India, 71(1): 77-80

Barber, Jennifer S. 2004. “Community Social Context and Individualistic Attitudes toward Marriage” Social Psychology Quarterly 67(3): 236-256.

Dutt, A. K., A. G. Noble , and S. K. Davgun. 1981. Socio-Economic Factors Affecting Marriage Distance in Two Sikh Villages of Punjab. Journal of Cultural Geography 2(1): 13-26

Klass, Morton. 1966. “Marriage Rules in Bengal”. American Anthropologist, New Series 68(4): 951-970.

Rosenzweig, Mark R. and Oded Stark. 1989. Consumption Smoothing, Migration, and Marriage: Evidence from Rural India. The Journal of Political Economy, 97(4): pp. 905-926