DESCRIPTIVE STATISTICS
Statistics is the main quantitative tool used by all social scientists, for this reason I make this series of posts to record my self study outcomes.
Revision of general concepts:
Type of data: Categorical variable (nominal, ordinal); Numerical variable (interval). ARRAY
Mode and Variation ratio for nominal variables; Median and Midspread for ordinal variables; Mean and SD or Median and midspread for interval variables.
Percentage, proportion, ratio and rates of occurrence
Importance of Z score, for different types of data, the main idea should be keep as many information as possible that the original data contain. Thus, for example when we compare the advance that China achieved in GDP and Human development, we should avoid reduce the numerical data to ordinal one, but use standardized Z score to compare the historical development in these two area.
Index construct: Index-Dimensions-Indicators-Variables-Values. To combine different values calculated in different units, we add up its respective rank standings by simply add up or weighted sum. We then have index rankings through index scores.
Central tendency- Mean(arithmetic mean), Weighted mean, Median, Mode
Variability- Standard deviation, Variance, Range, Midspread (interquartile range: upper quartile-lower quartile); Variation ratio
Skewness and Kurtosis: negative skewness when mean is larger than median, positive when mean is smaller than median. SK=3(mean-median)/sd
CORRELATION
Direction (positive or negative); Nature (Forms of line); Strength (How well to predict a dependent variable from knowing independent variable)
Correlation efficient = (Original error - Remaining error)/Original error
Correlation between categorical variables.
Cross tabulation with column percentage while independent variables in column; Mosaic plot graph.
Lambda correlation coefficient: ((Total-total mode)-((Category 1_total-Cateogry 1_mode)+(Category 2_total-Category 2_mode)))/(Total-total mode)).
When the category mode numbers are in the same row, then lambda correlation coefficient is always a zero.
Correlation between numerical variables.
Scatter graph with best-fit line (regression line)
Y=aX+b, which must pass the point (X,Y) when X=mean of DV and Y=mean of IV.
Coefficient of determination = (Variance of DV-Variance of errors)/(Variance of DV). Error is the difference between observed value and predicted value.
Correlation between categorical and numerical variables.
eta squared coefficient = ((Variance of DV-((Variance of category 1)+(Variance of category 2))/2)/(Variance of DV)
Coefficient of determination is the square of correlation coefficient (pearson r) to demonstrate the percentage of variance of X can be explained by variance of Y. Thus when X and Y is correlated, the variance of X and Y may both explained partly by a common factor Z. There's no implied causality between X and Y.
Choose the proper coefficient according to type of data set.
Pearson correlation score for numerical data
Chi-square for categorical data
Spearman rank correlation coefficient for ordinal variables
Point-biserial correlation coefficient for a categorical variable and a categorical variable
Rank-biserial correlation coefficient for a categorical variable and a ordinal variable
In my paper, cross tabulation with column percentage could be used to describe the difference of distribution of projects between public and private enterprises, while scatter graph could be used to show the impacts of GDP, Natural resources and Institutions on total number of projects.
-- 发送自我的 iPad
没有评论:
发表评论