Statistic is really a thing to take serious consideration. Evidence that taking from skewed sample is Anecdotal evidence, which assume only very particular cases. It is when generalizing to the mode trend of the data, we can see more generalized summary.Perhaps Anecdotal Evidence can refer back to smoker cases, where people sometimes said that smoking doesn't affect lung cancer. There are perhaps some cases where this might actually true. But the general trend stay strong, smoking did affect lung cancer. Often this evidence is dangerous when one make causation based on observe associated varables. As politics stabilized, rock music gain popularity in a country, considered as association but not causation.

In Statistic, mostly we divide into following components:

  • Research question
  • Population
  • Sample
  • Generalize

In the case of research to analyze certain alcohol impact, the targeted population is everyone that drink alcoholic. Turns out, because the sample only taken from ER at Hospital in Baltimore, it can only generalize to Residents of Baltimore.

Thing to consider is design, what's best question to approach the problem,, and the scope of the variables, whether the relation is correlation or causation.

  1. Identifying each of the type of variable of data your working with is the first step in EDA.This will make you easier to choose which kind analysis you want to do.

  1. Relationship of the two variable.
  2. If connected called as

We can divide studies into observational and experiment. Observational is when you take a survey and take sample inputs. Controlled experiment, is happen when you're conduct an experiment that you decide who's take the experiment, randomize group of people you decide, and which of them that you assigned a variables. There's two clear different between observational studies and controlled experiment. If you can see strong correlation between two variables in observational studies, assume you're doing random sampling, then you can generalize your findings based on target population.If you're doing controlled experiment and you're doing random assignment, you finds that there's strong correlation, you can make causation and generalization.

The Observational studies, you can see the corellation of the data, but can't make causation of the data, since you don't control the lurking variables.You make past data, and see the relationship when you are generate data throughout your studies.

On the other hand, in the experiment, you can control all lurking variables, that is the variables that potentially missing from your attention. If we control all this, we can make causation between explanatory variable and the response variables.

In the controlled experiment,For example workout vs not workout, you randomly assigned group of people into two group, thus randomly divided people that maybe category divided(gender, body shape) into equal quantity, then you make an experiment. On the contrary, because you don't randomly assigned in Observational Studies, your object experiment will be skewed. And because of that, you're missing the lurking variables, the people's category, and can't make causation.Lurking variables, often called Confounding variables, is what makes it clear distinct that correlation can't make causation.

Why not use just entire population and sample it?

Census will require a lot of resources, with time and money. People could also missing the location. Some people maybe in very deep urban area, or people that hard to measure, like illegal immigrants that don't want to fill out the survey.Population could also hard to be static, as we know they change location overtime.

There are few categories that we can defined bias when we're doing sampling from the population. Let's take raise of cost of the public transport at your city for example.

  • Convenience Sample : This are the sample that you do, when you do only in your neighborhood, and not the entire city. So people in your neighborhood will likely many more than others in the rest of the city
  • Non-response : The people that you're asking with is never take the public transport, so they don't answer with your survey. This could means that these people(not random) won't give the answer making you have non-response bias
  • Voluntary response : These are the people that using public transport only recently, and they have strong opinions about the issue. It maybe a little too strong, compared to the rest of categories in your city.

There are usually three kinds of method, when you're doing sampling

  • SRS: These are the samples when you randomly pick sample out of whole population
  • Stratified : We divide the population into groups, then sampling it from the groups. For example, groups the population with men vs women, and sample from each groups. Note that each of the strata is different towards one another, but within strata is similar.
  • Cluster : We cluster the population, and see if there's clusters that similar, only pick one cluster from each of the similarity groups, and sample from those.The clear different from stratified is that you have non-homogenous data inside each of the cluster, but clusters perform similarity towards one another. These happen when you don't want to explore entire population that will be take a lot of resources.

  • When doing simple random sampling, you can take random assignment.
  • In Stratified sampling, even when you divide the neighborhood, you still can do sampling
  • In Cluster sampling when it cluster based on neighborhood, you can't find similar neighborhood and only sample of these. It said that neighborhoods are distinct and unique, and avoid the one neighborhood will not be representative of the whole population.

When designing an experiment, these are the things that you should do.

  1. You want to divide the sample into treatment vs placebo groups, and to control this two group, also divide based on characater, e.g. gender and age/
  2. Randomly assign each of the subject, based on the characteristic groups, to treatment or placebo groups.
  3. Collect enough sample to do the experiment, sufficiently larger. If not, repeating the experiment by replicate the entire study.
  4. Block lurking/confounding variables, that suspected affect the results.

Suppose that the gels maybe achieve diferently in pro and amateur. We block the status, divide them into pro and amater group, and divide it again to treatment and control.These way, there are 4 group that represented equally.

If the explanatory variables, is the independant variable, a quantify variable to measure the causation of our dependant variable.

On the other hand, blocking variables is the variable that we think will behave differently, a category variable, so we're gonna observe it too.

Response variable is a dependant variable, some score of accuracy that we're meassure.

So if you're going to know whether the energy gels helps to increase people to run faster, you're divide into two groups, a group that receieve treatment, the energy gels, and second the group that receive placebo, fake treatment, and by doing this we don't give them energy gels. We're doing blinding method, the people we're doing experiment with don't know which treatment they're gonna get. Sometimes an experiment is done in double-blinding, the people and the researcher don't know the treatment and the placebo, to avoid favorite bias. Placebo effect would shown to the researchers what are lurking variables that help the people making run faster even without receiving the treatment

Random Sampling is taking sample from whole population, therefore the purpose is to take generalizablity to the population. Meanwhile Random Assignment serve to randomly assign each group of characteristic,to the treatment and the control.In other words, Random Assignment only come after Random Sampling.

The ideal experiment is when we can generalize to the whole population. But often it doesn't work that way. As we discussed earlier, sampling entire population is hard. So most experiment maybe doing causation for subset of population.Observational studies that doesn't show any correlation and can't be generalize can't be used. On the other hand, if it show any correlation, then we can share it to the public.

So in summary:

  • Showing strong correlation, you're study can always have causation if you have controlled experiment, whether you have random sampling or not. If you're doing random sampling, then you're experiment can be generalized to target population. Same goes for observational studies that have random sampling, it can generalized well. But bad experiment can be get from resulting no causation and not generalized to target population.

  • So why random sampling can be generalized to target population? Because when doing random sampling, we get sample that have independent of that target characteristics. The sampling won't take gender,characteristic,occupation,age, or everything else into account.

  • Random assignment on the other hand perform causation. This must always be done in controlled experiment. With this, we also be assured that all the samples are independent. Suppose for example you divide groups into age, gender, or else. Then you take random sampling and distribute equal number of people to groups that takes treatment vs placebo. Doing random assignment can get you avoid counfounding variables. That way we can have our causation results.

  • Cluster sampling is more benefit from the other two sampling, simple random sampling and stratified sampling. With simple random sampling, you don't know which one group that take weight more than the others, for example perhaps when doing SRS you grab more male than female. SRS is blindly sampling without knowing specific conditions. Stratified perform better than SRS as it's first differentiate with conditions. Male and Female can be sampled with equal number. But perhaps cluster sampling is more efficient than SRS. We can have different specific type condition in one cluster, clusters them, and based on one similarity group of clusters, sample clusters from each of the group, and random sample on each of the cluster selected. This way we are more assured of it random.

  • Blinding is an effective method when doing a controlled experiment. For example in research study of pill treatment, people receive blinding so people in treatment group don't get suggested that they feel healthier if they know they receive treatment, or people in placebo group don't feel suggested that they feel sick because they know they didn't receive treatment. This way the study can avoid bias from the people in the results of the experiment, for example in the survey results. Double blinding is also used to the researchers, so they don't whether the pill is treatment or placebo. That way the researchers not doing any bias when analyze the results, e.g. favor treatment group more than the placebo because they wanted to show the effectiveness of the treatment.


In [ ]: