Statistics: Rules and Pitfalls

# Statistics: Rules and Pitfalls
## Biodata Club
<div style='float:left'>
<hr color='#EB811B' size=1px width=720px>
</div>
 
### Jessica Minnier, PhD Assistant Professor of Biostatistics, OHSU-PSU SPH @datapointier
### March 2, 2018 Slides available at <a href="http://bit.ly/biodata-stats" class="uri">http://bit.ly/biodata-stats</a>

---

# Topics:

## 1. Big Picture: ["Ten Simple Rules for Effective Statistical Practice"](https://doi.org/10.1371/journal.pcbi.1004961)
Kass et al, 2016, PLOS Computational Biology

## 2. Specific Details: ["Common Statistical Pitfalls in Basic Science Research"](http://europepmc.org/articles/PMC5121512)
Sullivan et al, 2016, Journal of the American Heart Assocation

---

#  What is statistics?

### Statistics is not a hurdle to overcome

### It is not a recipe

### It is a framework for doing science

### It is a language ("with probability its grammar")

### It informs:

###design + data generation + model + analysis + inference + design next study

---

# Rule 1, Methods work *with* the data

### Statistical methods enable data to answer scientific Q's

- We want to answer a scientific question, not *just* analyze the data because it looks a certain way (method should not be based on just data structure)

- Not "Which test should I use?", but "How can the data provide answers?" (statistical test, different type of test, visualization, clustering), i.e. "Where are the differential genes?"

- Statistical experts are collaborators that
 + identify potential sources of variability
 + look for "hidden realities" that might mean the data can't answer your question
 + then develop analytic goals and strategies
 
- Statistical planning and collaboration needs to start EARLY (see Rule 3)

### Practice Team Science!

---

# Example:

### 2 groups, continuous variable; easy, right?

- T-test!(?)

- Not if there's correlation in the data, or confounder(s), or mismeasured outcomes

- What if the data are matched and you want to know if the variables move in the same direction/correlate? 
 + T-test is the wrong answer (just estimate correlation!)

---

# Variability

### Rule 2: Signals always come with noise

- Statisticians are trained to look for and deal with variability

- Models simplify patterns to describe variability to be able to detect signal

- Big data make these issues worse!

### Rule 7: Provide Assessments of Variability
 - Almost all measurements have uncertainty (estimate it!)
 
 - Careful! need to take into account dependencies in data, or else SEs are wrong
 
 - Batch effects increase variability
 
 - Big data is not a substitute for large `\(n\)` (careful when estimating SE and n)

---

# Example: Sample Size

### What's my `\(n\)`?

- Sample size/power calculations depend on variability

+ biological variability + technical variability + sampling variability
 
 + but this is often unknown, so consider a range of values
 
- Dependence structure in the data `\(\rightarrow\)` this will change the `\(n\)`

+ i.e. repeated measures, clustering, family/relatedness, matching

---

# Don't Waste Your Money

### Rule 3: Plan Ahead, Really Ahead
 - Not just "what should my n be?"
 
 - Statisticians are trained to look at the big picture and look for future problems in data collection, processing, analysis, inference (can we "design out" the problems?)
 
### Rule 4: Worry about Data Quality
 - Pre-processing, data cleaning, how did this data get here?
 
 - Careful! Why is this data missing?
 
---

# Ex: Problems you may not anticipate

### Outcome troubles

- Your outcome measure is too variable or measuring what you want in a biased way
- Too many outcome measures to detect meaningful effects after multiple testing correction

### Structure in your data

- Batch effects
- Sample size assuming independence could be incorrect

### Design

- Is your goal hypothesis testing, or is it actually estimation?
- How to randomize, stratify, identify how to remove bias and confounding?

---

# Modeling
 
### Rule 5: Statistics `\(>\)` Set of Computions
 - Always EXPLAIN how the methods help you answer the biological question (why did you pick that method?)
 
 - Code knowlege `\(\neq\)` statistics knowledge
 
### Rule 6: Keep it Simple (KISS)
 - Often provide the strongest results, also interpretability
 
 - Careful! curse of overfitting (not generalizable)

---

# Rule 8: Check Assumptions

### The Big Ones:

- Normality
- Independence
- Missing Data (bias related to missingness not at random)

### Careful!

- Some models/tests are more robust to deviations than others
- Skewness can give you very weird results
- Systematic bias (i.e. in sampling)
- Visual checks (i.e. plot residuals from a regression)
- You may need: nonparametric analysis, survival analysis, longitudinal analysis

---

# Ex: Parametric vs Non-parametric Tests?

### Often misunderstood

- T-tests assume normality of the population, but in *large* samples the central limit theorem usually "kicks in" since distribution of means are usually normal (but again, unders assumptions! independence!)
 + Ratios or "fold changes" are not normally distributed
 
- Wilcoxon test or other common nonparametric tests do not assume normality, but you are no longer testing means (nor medians, unless specific assumptions are met)
 + outcome: ordinal or interval
 + can have lower power in small samples, but more conservative
 
- Common nonparametric tests often performed by replacing data with ranks

---

# Practice Sound Science

### Rule 9: When Possible, Replicate!

- P-hacking, overtesting, overselecting, overfitting --> results cannot stand the test of time (replication)
- Replicate using new data
- Know your limitations (small sample, biased sampling, unmeasured confounding) and remove those issues next time. Learn from your study!

### Rule 10: Make Your Analysis Reproducible
 
- Minimum standard
- Given same data, description of analysis --> reproduce all results/tables/figures
- Use reproducible code tools (i.e R Markdown), version control (i.e. git)

---

# Example: Multiple testing

### Scary but real

- Each test has a nonzero (likely 5%) probability of incorrectly claiming significance

- Each test you make increases this error rate

- Know what type I error and False Discovery Rate (FDR) and Family-Wise Error Rate (FWER) mean and what the differences are

- If you want to relax the error rates, you are doing an *exploratory* study. No denial. Now, replicate!

---

# Misc. Errors to Avoid

- Acknowledge the ways the data are SELECTED prior to formal analyses
- Do not use the same data set to both generate and test hypotheses
- Relatedly, practice safe EDA (exploratory data analysis)
 + Use EDA to visualize distributions and look for outliers or possible model misspecifications, but [don't choose which hypotheses to test after EDA](https://en.wikipedia.org/wiki/Testing_hypotheses_suggested_by_the_data) (unless your study is truly exploratory)
- Use dot plots when you have small samples (not bar plots, or even just boxplots)

![](minnier_biodata_files/figure-html/unnamed-chunk-2-1.png)

---

# Similar Articles/References

- Harrel, Frank. [Manuscript Checklist](http://biostat.mc.vanderbilt.edu/wiki/Main/ManuscriptChecklist) on 
Vanderbilt Biostatistics website.
- Wicherts, Jelte M., et al. ["Degrees of freedom in planning, running, analyzing, and reporting psychological studies: A checklist to avoid p-hacking."](https://www.frontiersin.org/articles/10.3389/fpsyg.2016.01832/full) Frontiers in Psychology 7 (2016): 1832. 
- Marino, Miguel. ["Reflections from a statistical editor: elements of great manuscripts."](http://www.annfammed.org/content/15/6/504.short) (2017): 504-506.
- Zinsmeister, Alan R., and Jason T. Connor. ["Ten common statistical errors and how to avoid them."](https://www.nature.com/articles/ajg20085055) The American journal of gastroenterology 103.2 (2008): 262.
- Altman, Douglas G., and J. Martin Bland. ["Improving doctors' understanding of statistics."](http://www.jstor.org/stable/2983040) Journal of the Royal Statistical Society. Series A (Statistics in Society) (1991): 223-267.

---

# Checklist avoid p-hacking, Wicherts et al

---

# Resources on Campus

Biostatistics and Design Program (BDP)
- Email bdp@ohsu.edu
- Campus-wide (University Shared Resource)
- Drop in hours on Tuesday
- Can use Dean’s award or USR awards to pay for core services for grants
- https://www.ohsu.edu/xd/research/research-cores/bdp.cfm

Knight Cancer Institute Biostatistics Shared Resource (BSR)
- For Knight Cancer Institute members/investigators
- Pre-award help is subsidized
- Submit project request: https://bridge.ohsu.edu/research/knight/resources/BSR/SitePages/Project%20Requests.aspx

West Campus: BBU Biostatistics and Bioinformatics Unit
- Email Suzi Fei
- http://www.ohsu.edu/xd/research/centers-institutes/onprc/research-services/research-support/Biostatistics-a-Bioinformatics.cfm

---

# Courses at OHSU

School of Public Health Courses
- BSTA 523 Design of Experiments
    + Statistical principles of research design and analysis
- PHPM 524 Introduction to Biostatistics
- PHPM 525/BSTA 511 Biostatistics I: Estimation and Hypothesis Testing in Applied Biostatistics

Human Investigations Program (HIP)
- HIP 528 and 529: Applied Biostatistics I and II

CSEE Computer Sci/Electrical Engineering
- Data Science classes
- MATH 530/630 Probability & Statistical Inference for Scientists and Engineers

---

# Learn Statistics

- [Modern Dive](http://moderndive.com/) An Introduction to Statistical and Data Sciences via R - Chester Ismay & Albert Kim
- [Biostatistics for Biomedical Research](http://www.fharrell.com/doc/bbr.pdf) - Frank E Harrell Jr & James C Slaughter, from [ClinStat Class at Vanderbilt](http://biostat.mc.vanderbilt.edu/wiki/Main/ClinStat)
- Coursera, edX, Data Camp

---

# Thank you!

Contact: minnier-[at]-ohsu.edu, [datapointier](https://twitter.com/datapointier), [jminnier](https://github.com/jminnier/)

Slides available at <http://bit.ly/biodata-stats>

Code for slides available at <https://github.com/jminnier/talks_etc>

Slides created via the R package [xaringan](https://github.com/yihui/xaringan) by [Yihui Xie](https://twitter.com/xieyihui?lang=en) with the metropolis theme