R语言统计入门课程推荐——生物科学中的数据分析Data Analysis for the Life Sciences

EndNote相关资讯 | 2018-09-06 09:14

Data Analysis for the Life Sciences是哈佛大学PH525x系列课程——生物医学中的数据分析(PH525x series - Biomedical Data Science),课程全部采用R语言进行统计分析理论教学与实战。教材采用Rmarkdown语言编写,易轻松易读,又保证分析的可重复性,代表了科学界最先进的可重复计算要求,我们不仅可以系统学习一个生物学家所要掌握的统计知识,还能新手用代码实现,并达到CNS发表可重复代码的要求。

传统的统计材料关注数学原理。而本文重点是用计算机实现数据分析。本书采用实例来讲解数学原理,提供代码亲自实现分析。全文采用R markdown编写,保证读者完成全部分析。

关于作者:

Rafael A Irizarry是哈佛大学公共卫生学院丹娜法伯癌症研究院的生物统计和计算生物学教授,有17年分析基因组数据的经验。

Michael I Love是北卡教堂山大学生统与遗传系助理教授。研究方向为利用统计模型发现基因组数据中的生物为规律,并开发了Bioconductor中开源统计软件。

课程源代码: 包括课程所有源代码、测试数据和结果

网页版教程: ,包括课程的Rmd运行结果网页教程,和Rmd源代码的每节导航和下载链接。

电子书: 方便下载各版本在移动端阅读

有意思的是可选择免费学习,或最高付给作者80$。

R markdown source files

ePub version on Leanpub

Links to the HarvardX class pages

External resources and books

Finding more help for data analysis

Introduction [Rmd]

Getting started [Rmd]

Getting started exercises

数据操作dplyr introduction [Rmd]

dplyr introduction exercises

Mathematical notation [Rmd]

随机变量Random variables [Rmd]

Random variables exercises

群体与样本Populations and samples [Rmd]

Populations and samples exercises

CLT and t-distribution [Rmd]

CLT and t-distribution exercises

CLT in practice [Rmd]

CLT in practice exercises

t-test in practice [Rmd]

置信区间Confidence intervals [Rmd]

Power calculations [Rmd]

Power calculations exercises

Monte carlo [Rmd]

Monte carlo exercises

排列检验Permutation tests [Rmd]

Permutation tests exercises

关联研究Association tests [Rmd]

Association tests exercises

Exploratory data analysis [Rmd]

Plots to avoid [Rmd]

Exploratory data analysis exercises

Robust summaries [Rmd]

Rank tests [Rmd]

Robust summaries exercises

回归Introduction to using regression [Rmd]

Introduction to using regression exercises

Matrix notation [Rmd]

Matrix notation exercises

Matrix operations [Rmd]

Matrix operations exercises

Matrix algebra examples [Rmd]

Matrix algebra examples exercises

Linear models introduction [Rmd]

Linear models introduction exercises

Expressing design formula [Rmd]

Expressing design formula exercises

Linear models in practice [Rmd]

Linear models in practice exercises

Standard errors [Rmd]

Standard errors exercises

Interactions and contrasts [Rmd]

Interactions and contrasts exercises

Collinearity [Rmd]

Collinearity exercises

QR and regression [Rmd]

Linear models going further [Rmd]

Introduction to high-throughput data [Rmd]

Introduction to high-throughput data exercises

Inference for high-throughput data [Rmd]

Inference for high-throughput data exercises

Multiple testing [Rmd]

Multiple testing exercises

EDA for high-throughput data [Rmd]

EDA for high-throughput data exercises

Modeling [Rmd]

Modeling exercises

Bayes theorem [Rmd]

Bayes theorem exercises

Hierarchical models [Rmd]

Hierarchical models exercises

Distance [Rmd]

Distance exercises

PCA motivation [Rmd]

SVD exercises

Projections [Rmd]

Rotations [Rmd]

MDS exercises

聚类和热图Clustering and heatmaps [Rmd]

Clustering and heatmaps exercises

Conditional expectation [Rmd]

Conditional expectation exercises

Smoothing [Rmd]

Smoothing exercises

Machine learning [Rmd]

Crossvalidation [Rmd]

Crossvalidation exercises

Introduction to batch effects [Rmd]

Confounding [Rmd]

Confounding exercises

EDA with PCA [Rmd]

EDA with PCA exercises

Adjusting with linear models [Rmd]

Adjusting with linear models exercises

Factor analysis [Rmd]

Factor analysis exercises

Adjusting with factor analysis [Rmd]

Adjusting with factor analysis exercises

Mike Love’s general reference card

Motivations and core values (optional)

Installing Bioconductor and finding help [Rmd]

Data structure and management for genome scale experiments [Rmd]

Coordinating multiple tables: ExpressionSet

Institutional archives: GEO, ArrayExpress

Interlude: Working with general genomic features using GenomicRanges

IRanges introduced

Intra-range operations

Inter-range operations

Calculating overlaps

Range-oriented solutions for current experimental paradigms

SummarizedExperiment: for RNA-seq and 450k methylation

External storage for very large assays

GenomicFiles for families of BAM or BED

DNA Variants: VCF handling with VariantAnnotation and VariantTools

Handling multiomic archives like TCGA

Cloud-oriented solutions: e.g., Google BigQuery

Short read mapping/alignment software (optional) [Rmd]

More details on GRanges [Rmd]

Run-length encoding, views

Application to genomic landmarks

Application to 450k methylation array visualization

General overview of Bioconductor annotation [Rmd]

Levels: reference sequence, regions of interest, pathways

Discovering reference sequence

A build of the human genome

Gene/Transcript/Exon catalogs from UCSC and Ensembl

Importing and exporting regions and scores

AnnotationHub: brokering thousands of annotation resources

OrgDb: simple interface to annotation databases

Finding and managing gene sets

OrganismDb: unifying diverse annotation

Cheat sheet on Bioconductor annotation [Rmd]

Translating addresses between genome builds: liftOver [Rmd]

区分生物重复和技术重复的变异Distinguishing biological and technical variability [Rmd]

An experiment with pooled and individual samples

Measuring technical variation

Measuring biological variation

Interpretation

多重比较Multiple comparisons with genewise t-tests [Rmd]

Gene-wise testing

Naive enumeration of genes

Demonstrating danger of multiple testing with a set of sham comparisons

Adjusting for multiplicity with qvalue

Adjusted counts in the sham case

Moderated t tests via limma [Rmd]

A spike-in dataset

Naive t-tests

Three steps with limma: lmFit, eBayes, topTable

Exposing the spiked-in genes

A view of the shrinkage of variance estimates

基因集分析Introducing gene sets and gene set analysis [Rmd]

Identifier remapping

Categorical testing

Statistical summaries for sets: Wilcoxon

Statistical summaries for sets: t statistics

A dataset for comparing expression by gender

Finding surrogate variables/batch effect correction

Data wrangling

The Broad Institute MsigDb

Adjusting for within-set correlation

A permutation procedure

可视化任务与策略A basic overview of visualization tasks and strategies[Rmd]

Gene models

Gene models plus data

Driving visualizations with functions

Using the browser to drive visualization functions via shiny

Queriable dynamic displays with plotly

Annotation-oriented visualizations

Sketching the binding landscape over chromosomes with ggbio’s karyogram layout [Rmd]

Plotting data in the context of genomic features with Gviz [Rmd]

Visualizing NGS data [Rmd]

Interactive visualization

Graphical user interfaces for multivariate data with shiny [Rmd]

Clustering gene expression data with shiny [Rmd]

Final remarks on visualization [Rmd]

Parallel computing with R and Bioconductor [Rmd]

Demonstrating simple speedup in multicore environments

Implicit parallelism with BiocParallel and GenomicAlignments

External data: data interfaces that spare RAM[Rmd]

SQLite for annotation

Tabix-indexed BAM

An illustration of NoSQL with S4: mongodb and RaggedMongoExpt[Rmd]

Benchmarking various out-of-memory solutions[Rmd]

Introduction to Bioconductor’s Amazon Machine Instance for cluster creation and use in EC2 [Rmd]

Sharded GRanges for scalable integrative analysis[Rmd]

Basic examples of multi-omic integration[Rmd]

Transcription factor (TF) binding and gene coexpression in yeast

TF binding and GWAS hits in humans

Using RTCGAToolbox outputs to integrate clinical, mutation, expression and methylation assays[Rmd]

Associating tumor stage with expression patterns

Linking DNA methylation with expression patterns

Defining a severity marker

Extracting survival times

Basic data acquisition

Working with clinical data

Working with mutations

Curation tasks for discrepant identifier formats

Working with expression data

Application to visualization: kataegis and rainfall plot[Rmd]

Overview of unit on reproducibility[Rmd]

Basic definitions

Infrastructure requirements

Statistical aspects of reproducibility

Analysis of reproducibility probability (Boos and Stefanski 2011)

Costs of highly reproducible designs

Package structure, creation, installation, management[Rmd]

create() to set up folders and DESCRIPTION

Composing documentation plus code

document(), install()

What is a package?

Using package.skeleton

Using makeOrganismPackage

Using devtools

Conclusions, including a link to a recent Nature Toolbox article on Bioconductor

我们选择在线阅读网页版教程,结合源代码进行练习。

逐节阅读学习,内容较多。读者可挑选适合自己的章节学习即可。

有实战的内容,都有Rmd的源代码,下载用本地的Rstudio打开即可。

批量下载所有资源

Linux下使用git或wget下载

# 方法1. 解压后为labs-master目录 wget -c unzip master.zip # 方法2. 下载为labs目录下 git clone :genomicsclass/labs.git猜你喜欢10000+:

系列教程:

专业技能:

一文读懂:

必备技能:

文献阅读

扩增子分析:

在线工具:

科研经验:

编程模板:

生物科普:

学习16S扩增子、宏基因组科研思路和分析实战,关注“宏基因组”