2011年2月28日星期一

Statistics 4: Hypothesis test between groups (categorical variable and numerical variable)

I. INDEPENDENT SAMPLE T TEST

Independent-samples t test assumes the distributions of a (or more) variables in two groups is same. (untold assumption: the internal nature of the group is the independent variable)

condition: two indepedent samples take one test only.

NewImage

t(58)=-1,14, p>0.05

1. df (degree of freedom)=n1+n2-2

2. p value

3. one-tailed or two-tailed test

Effect size (ES) is a measure of the strength of the relationship between two variables in a statistical population, or a sample-based estimate of that quantity. An effect size calculated from data is a descriptive statistic that conveys the estimated magnitude of a relationship without making any statement about whether the apparent relationship in the data reflects a true relationship in the population. In that way, effect sizes complement inferential statistics such as p-values. In certain sense, that means p-value tell you the probability to take the wrong decision to reject or accept the null hypothesis, while effect size or correlation coefficient tells you strength of the relationship between variables.
Cohen's d: (X1-X2)/SD
http://www.uccs.edu/~faculty/lbecker/

II. ANOVA (SIMPLE ANALYSIS OF VARIANCE / ONE WAY ANALYSIS OF VARIANCE)

One-way analysis of variance assumes the distribution of a (or more) variables in more than two groups is same. 

condition: more than two groups take one test (one variable).

NewImage

F(2,27)=8.80, p<0.05

1. df (between)=k-1; df (within)=N-k; df(total)=N-1

2. p value

3. only two tailed test

To determine where is the difference, run post-hoc test with Bonferroni analysis

III. FACTORIAL ANALYSIS OF VARIANCE

Factorial analysis of variance assumes the distribution of a (or more) variables in more than two groups categorized by two variables is same.

Main effect and interaction effect

SPSS use univariate analysis of variance, for it concerns only one explained variable or one dependent variable. 

*if more than two variables, Holy Grail Analysis of Variance

IV. PARED SAMPLE T TEST

Pared sample t test assumes the distribution of a (or more) variable in one group before and after an experiment is same. (untold assumption: the experiment is the independent variable)

condition: only one group takes two tests.

NewImage

t(24)=2.45, p<0.05

1. df=n-1

2. p value

3. one tailed or two tailed test.

 

2011年2月24日星期四

Statistics 2: Inferential statistics concepts

INFERENTIAL STATISTICS

Inferential statistics is used to infer the information about population from the observation of a sample.

Population-Sampling frame-Sampling pool (probability/random sampling or non-probability sampling)-Sample

Estimation

Estimation of numerical data

The principal is that if we can draw all samples from a population, then the distribution of sample means will take a normal curve with the sample means that equal the population means situated in the center. However, we can only have the observation of one sample, and we don't know where exactly located that sample in the distribution of samples, in other words, we don't know whether we have a typical representative sample or not. We then estimate the standard deviation of population from the sd of sample by divided (n-1). That is also why the standard deviation calculation is often divide variance by (n-1) not n. We then have standard deviation of sampling distribution of means by divide sd of population by square root of sample size. According to nature of the normal distribution, we estimate the range of population means from the means of sample with a certain confidence level. Say (sample means-2sd; sample means+2sd) with 95% of confidence level.

Student t shows the adjustment of confidence level when the sample size varies. Basically when the sample size is higher than 30, it respects the normal distribution.

Estimation of categorical data

The principal is the same only we treat category percentage here. The calculation of standard deviation of sampling distribution of percentage will be:

NewImage

Another difference is that the confidence level of the estimation is not only influence by the sample size but also by the category percentage in the population. As there is no student t table for categorical data, when have another benchmark: smaller category percentage of sample * sample size > 1000, if not increase sample size.

Hypothesis testing

A good hypothesis should:

1) reflect the theoretical background and available references

2) short and clear affirmative sentence

3) relationship between variables

4) testable

Inferencial statistics basic concept:
1. Null hypothesis vs. research hypothesis (nondirectional and two tailed test; directional and one tailed test)

Null hypothesis means in the whole population, there is no difference between two samples, or in other words the independent variables doesn't cause the different distribution of dependent variable. the observed variance is due to the coincidence.

2. Normal Curve

Mean=Median; 34.13% of the data is between mean and mean+1 standard deviation (or -1); 13.59% between +1 and +2 sd; 2.15% between +2 and +3 sd, 0.13% more than +3 sd. (Probability: < +1 84%; (+1,+2) 14%; (+2, +3) 2.15%)

3. Standard score (Z)

measure the distance tween a data x and the sd, which can infer the probability of the appearance of a data x.

4. Significance rate: 5%
Z score is to determine whether one event is caused by purely chance or just a result of casual distribution of probability. thus, if one event is hardly to happen under normal condition, that means below 5% (Z>I1.65I), then null hypothesis is wrong.

5. Significance level (p)

6. Degree of freedom (df)


Type I error (α): while null hypothesis true you decide null hypothesis is wrong. and Type II error (β): vice versa
null hypothesis is about the total and can not be verified directly.

Type II error is related to the sample size

I think the most valuable part of today's chapiter is about the design of inferencial statistics. the significance of statistic itself is meaningless. the most important task goes to the analytic work generating the hypothesis about variables. in the practical aspect, the sample construction is also more important then the verification itself. 

How to control the variables and how to choose the appropriate instruments is what I'll continue to learn. but I don't think it will be more difficult and more complicated than the original academic analysis, especially with help of computer software. 

P.108

2011年2月21日星期一

Statistics 1: Desciptive Statistics

DESCRIPTIVE STATISTICS

Statistics is the main quantitative tool used by all social scientists, for this reason I make this series of posts to record my self study outcomes.

Revision of general concepts:

Type of data: Categorical variable (nominal, ordinal); Numerical variable (interval). ARRAY

Mode and Variation ratio for nominal variables; Median and Midspread for ordinal variables; Mean and SD or Median and midspread for interval variables.

Percentage, proportion, ratio and rates of occurrence

Importance of Z score, for different types of data, the main idea should be keep as many information as possible that the original data contain. Thus, for example when we compare the advance that China achieved in GDP and Human development, we should avoid reduce the numerical data to ordinal one, but use standardized Z score to compare the historical development in these two area.

Index construct: Index-Dimensions-Indicators-Variables-Values. To combine different values calculated in different units, we add up its respective rank standings by simply add up or weighted sum. We then have index rankings through index scores.

Central tendency- Mean(arithmetic mean), Weighted mean, Median, Mode
Variability- Standard deviation, Variance, Range, Midspread (interquartile range: upper quartile-lower quartile); Variation ratio

Skewness and Kurtosis: negative skewness when mean is larger than median, positive when mean is smaller than median. SK=3(mean-median)/sd


CORRELATION

Direction (positive or negative); Nature (Forms of line); Strength (How well to predict a dependent variable from knowing independent variable)

Correlation efficient = (Original error - Remaining error)/Original error

Correlation between categorical variables.

Cross tabulation with column percentage while independent variables in column; Mosaic plot graph.

Lambda correlation coefficient: ((Total-total mode)-((Category 1_total-Cateogry 1_mode)+(Category 2_total-Category 2_mode)))/(Total-total mode)).

When the category mode numbers are in the same row, then lambda correlation coefficient is always a zero.

Correlation between numerical variables.

Scatter graph with best-fit line (regression line)

NewImage Y=aX+b, which must pass the point (X,Y) when X=mean of DV and Y=mean of IV.

Coefficient of determination = (Variance of DV-Variance of errors)/(Variance of DV). Error is the difference between observed value and predicted value.

Correlation between categorical and numerical variables.

eta squared coefficient = ((Variance of DV-((Variance of category 1)+(Variance of category 2))/2)/(Variance of DV)

Coefficient of determination is the square of correlation coefficient (pearson r) to demonstrate the percentage of variance of X can be explained by variance of Y. Thus when X and Y is correlated, the variance of X and Y may both explained partly by a common factor Z. There's no implied causality between X and Y.

Choose the proper coefficient according to type of data set.
Pearson correlation score for numerical data
Chi-square for categorical data
Spearman rank correlation coefficient for ordinal variables
Point-biserial correlation coefficient for a categorical variable and a categorical variable
Rank-biserial correlation coefficient for a categorical variable and a ordinal variable

In my paper, cross tabulation with column percentage could be used to describe the difference of distribution of projects between public and private enterprises, while scatter graph could be used to show the impacts of GDP, Natural resources and Institutions on total number of projects. 

-- 发送自我的 iPad

位置:

Research methods of Chinese economy

As an economist trained in Europe, I'm always wondering the appropriate research methods for Chinese economy. I'm not totally against the econometric modeling of conventional economic studies, but I question recently the logic basic of this kind of research. Where come all those hypothesis guiding the mathematic efforts? Yes, all our current works are based on the precedent research. But how about if the precedent works are a series deductive reasoning of a sort of "axiom", like interest maximizing, which is questionable? If the theoretical base is on doubt, where will go the empirical studies? Especially when there exist so many statistical traps and the unreliability of data source in the context of a transitional economy like China. In the end, the science of economy is one branch of social sciences examining the human behavior and interaction. Should we proceed it as a natural science? The simplicity and beauty of mathematic models is attractive but engendering an illusive certainty for the consumers living in an uncertain world. One can argue that the simplification characterized by strict assumptions is methodologically necessary, but what has not been said is the hidden pre assumptions considered as non debatable. Even though this pre assumption, such as self interest maximization, as claimed by mainstream economists, can explain 80% of the reality, the rest 20% unanswered may conceal some facts more important. Thus, I have to agree with my supervisor about the proper reasoning order in social science including economy. That you have to first observe the social facts, describe it as authentically as possible, then analyze it to generate the hypothesis, which need to be verified by other quantitative methods. As you can see, the qualitative study may precede an quantitative study for the latter is just one means to check the former, but not by itself the goal of the scientific research. What I don't agree with my supervisor resides in when should the reference of pertinent works intervene. The searching and reading of precedent works (specific or theoretical), in my eyes, may happen in whenever, either at the beginning of a project or at the end of it, in case that it doesn't manipulate and mislead what you really observe in the field. Thus, I may should describe me as a micro-economist, who favors the inductive reasoning from microlevel, and who consider the macroeconomy as an agglomeration of micro fibers.


-- 发送自我的 iPad

位置:

2011年2月7日星期一

Welcome

Welcome everyone to my first bolg on Chinese Economy. It's basicly the place I'll store and comment the latest news and research articles, but it's also the platform to exchange ideas with all that are interested in the emerging China.