开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 游戏开发 -> 3K水稻SNP数据集的简单利用 -> 正文阅读

[游戏开发]3K水稻SNP数据集的简单利用

3010份亚洲稻群体重测序项目是由中国农业科学院作物科学研究所牵头，联合国际水稻研究所、华大基因等16家单位共同完成。该研究对来自89个国家的3,010份水稻品种进行了重测序研究（resequence），参考的是日本晴（Nipponbare）基因组。所有3,010个基因组的平均测序深度（average sequencing depth）为14×,平均基因组覆盖率（average genome coverage）和绘图率（average mapping rate）分别为94.0％和92.5％。
该项目的意义自然不用多说，并且该项目的测序数据都可以在国际水稻所或CNGB上下载。我们可以从原始序列开始组装得到大的结构变异或call到一些稀有SNP来做研究。当然以上想法不在本文讨论范围内，本文主要讨论只想知道几个基因区域上 SNP 及其分化系数等群体指标的计算。

2021.09.15 更新了单倍型分型、作图和方差分析。

放在前面

RFGB可以极大程度上满足查找某个基因上的SNP&indel和单倍型查找。并提供15个表型可以进行单倍型分析，也可以上传表型进行分析。接下来探索一下。

这个页面提供单倍型分析。分别可以用基因名，染色体位置和特定的SNP来进行分析。提供MAF是最小等位基因频率，建议选一下。可以用全部的3024个品种进行查找也可以分亚群进行查找,也提供了表型关联分析。

这里是用的LOC_Os08g33530 CDS 区域，选的籼稻群体，共找到20个SNP，和16个单倍型。其中hap11的谷粒长最高。
这个网站用来做单倍型都是SNP，都是复等位的，就是没找到在哪下数据集。
可惜没有LD的图和pi值。
但我找到了水稻单倍型的数据，具体可看文章
The landscape of gene–CDS–haplotype diversity in rice: Properties, population organization, footprints of domestication and breeding, and implications for genetic improvement
数据地址

也可以用这个单倍型数据和表型做关联分析，用方差分析就可以。表型数据下载

以下主要是3kRG构建的SNP的探索以及简单利用。

工具准备

R
R package(LDheatmap,genetics)
plink1.9(3个操作系统版本都有)下载地址
vcftools(只有liunx版本，只是用来算群体遗传统计)下载地址

数据下载

可以在IRRI上下载到双等位SNP&indel数据（多等位的SNP貌似只能从头call，没找到下载的地方)。bed、bim、fam文件为一组，bed 是存放位点信息的二进制格式文件,用文本打开会发现是乱码；bim是存放位点信息的文件，类似于map格式；fam是存放样本信息的文件，第一列为FID,第二列为IID。
bim
fam
Nipponbare_indel 是双等位indel数据集，共有2.3m个，但是是没有经过筛选的，后续需要加工下。
双等位的SNP标记共5个。
NB_final_snp 包含了所有的SNP，共29m个，是没有经过筛选的。
Base 的SNP是经过基础筛选的，共18m个。
base_filtered_v0.7 是从Base的SNP进一步筛选的，共4.2m个，本文主要利用这个数据集。
pruned_v2.1 是2.1版本的筛选方案，包含1m个SNP。
base_filtered_v0.7_0.8_10kb_1_0.8_50_1 是在base_filtered_v0.7 的基础上进一步筛选共 404k 个，在用的时候发现很多基因区域上都没有SNP，毕竟个数少，比较稀疏。
各个数据集的筛选方法可以看看里边的readme文件，不同的筛选方案主要目的就是提高精度，找到因果变异。
#准备目标基因的位置
其实只需要染色体号和起始位置就好了，基因名字写啥都行，就是拿来做图片标题的。忘记提了，这个数据集中没有线粒体和叶绿体上的标记。建议起始位置可以把上游和下游都囊括进去，多个2kb，3kb 都是可以的，多找几个总是好的，万一和表型关联上了呢。基因列表，以tab分割

合并想要的SNP数据集&indel数据集

其实本可以一个个数据集来，但我想把2.3m个indel和4.2m个SNP标记合在一起分析咋办呢。最简单的办法还是用vcftools或bcftools合并两个数据集。可以参考这篇。bash代码如下：

plink --bfile base_filtered_v0.7 --recode vcf-iid --out base_filtered_v0.7 #先转到vcf格式
plink --bfile Nipponbare_indel --recode vcf-iid --out Nipponbare_indel
###由于Nipponbare_indel 中少一个样本 IRIS_313-8921 并且这个样本在官网给的分类为NA，空值。所以就需要删掉它
vcftools --vcf base_filtered_v0.7.vcf --recode --recode-INFO-all --stdout --indv IRIS_313-8921 > rm_na_base_filtered_v0.7.vcf

bcftools view Nipponbare_indel.vcf -Oz -o Nipponbare_indel.vcf.gz  #构建引索
bcftools index Nipponbare_indel.vcf.gz 

bcftools view rm_na_base_filtered_v0.7.vcf -Oz -o rm_na_base_filtered_v0.7.vcf.gz
bcftools index rm_na_base_filtered_v0.7.vcf.gz

bcftools concat rm_na_base_filtered_v0.7.vcf.gz Nipponbare_indel.vcf.gz -a -O v -o combine_base_filtered_indel.vcf #合并两个vcf文件

plink --vcf combine_base_filtered_indel.vcf --make-bed --out  combine_base_filtered_indel #再转回bed格式 主要是这样占内存小点，这个vcf文件有55G ，bed文件只有5.23G

那么就不想用vcftools合并两个数据集咋办呢？有些时候并不想得到所有位点的信息，只需要一部分就可以了。其实可以用R和plink来实现合并两个数据集，并只提取几个位点，这种方案内存代价小点，且看后续讲解。

准备分类文件

群体分类文件如下原始文件
这个分类太细了，给它分成五类就可以了，GJ、XI、admix、aus、bas，用excel的分列就能完成，需要把IRIS_313-8921去掉，这个分类是NA。当然也可以用自定义的分类。
整合成plink能识别的分类格式，分别是FID,IID,class plink 格式

突然想起踩到的一个坑，plink1.9和2在筛选的时候无法识别样本名称带有"-"字符，所以需要把分类文件里 “-“换成”_” ,"IRIS_313-8921"换成"IRIS_313_8921"如用 excel 就可以完成，对了fam文件里也要换。

目标区域SNP位点提取

首先需要把plink.exe和分类文件(plink_cluster.txt)放到工作目录下，方便调用，也可以把它放到环境目录下。

R代码如下：

#基本参数
gene_list <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene_len.txt"  #目标区间
out_name <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene" #输出文件前缀
data_name <- c(".\\data\\base_filtered_v0.7",".\\data\\Nipponbare_indel") #SNP数据集,输入多个就是合并
pop_name<-c("GJ","XI","admix","aus","bas","GJ XI","admix aus bas") #需要生成的群体
fst_pop <- c("GJ XI","admix aus bas") #计算fst需要的群体 需在pop_name出现过
#质控参数
maf <- 0.05 
miss <- 0.4

color.rgb <- colorRampPalette(rev(c("white","red")),space="rgb") #color for LD

target_gene <- read.table(gene_list,header = T)
extract <- cbind(target_gene $CHROM,target_gene $start,target_gene $end,target_gene $GENE)
write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
#提取目标区域位点 并输出vcf文件 可以用Tassel 看LD
#查看样本是否缺少
tem <- c()
for (i in 1:length(data_name)) {
    temp <- read.table(paste0(data_name[i],".fam"),header = F)
    temp <- nrow(temp)
    tem <- c(tem,temp)
}
#用少了样本的数据集来keep
keep_file <- data_name[which(tem %in% min(tem))]

for (i in 1:length(data_name)) {
    total_cmd <- paste0(".//plink --bfile ",data_name[i] ," --extract temp_extract.txt --range --keep ",paste0(keep_file,".fam")," --recode vcf-iid --out ",paste0(out_name,"_",i))
    system(total_cmd,intern = FALSE)
}
temp_all <- read.table(paste0(out_name,"_",1,".vcf"),header = F,sep = '\t')

if (length(data_name)>=2){    #合并数据集
    for (i in 2:length(data_name)) {
        temp <- read.table(paste0(out_name,"_",i,".vcf"),header = F,sep = '\t')
        temp_all<-rbind(temp_all ,temp )
    }
    temp_all<-temp_all[order(temp_all[,1],temp_all[,2]),] #按照染色体和POS排序
}

### 列名
fam_name<-read.table(paste0(keep_file,".fam") ,header = F,sep = ' ')[,1]
vcf_colnames<-c("#CHROM",	"POS","ID","REF","ALT","QUAL","FILTER","INFO","FORMAT")
vcf_colnames<-c(vcf_colnames,fam_name)
###输出结果，vcf
names(temp_all)<-vcf_colnames
write.table(temp_all,file=paste0(out_name,"_","target_all.vcf"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE)
###筛选
total_cmd <- paste0("plink --vcf ",paste0(out_name,"_","target_all.vcf")," --const-fid --maf ",maf," --geno ",miss," --make-bed --out ",paste0(out_name,"_","target_all")) #直接用vcf筛选会有点问题，还是再转下bed格式
system(total_cmd,intern = FALSE)

temp <- read.table(paste0(out_name,"_","target_all",".fam"),header = F)
temp[,1] <- temp[,2]
write.table(temp,file=paste0(out_name,"_","target_all",".fam"),sep = " ",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE)
total_cmd <- paste0(".//plink --bfile ",paste0(out_name,"_","target_all") ," --recode vcf-iid --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

#不同群体提取 按基因提取 
for(g in target_gene$GENE){
    extract <-dplyr::filter(target_gene ,GENE==g)
    extract <- cbind(extract$CHROM,extract$start,extract$end,extract$GENE)
    write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range "," --make-bed --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)

    for (p in pop_name) {
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
###        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --make-bed --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
###        system(total_cmd,intern = FALSE)
    }
}

for (p in pop_name) {
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --within plink_cluster.txt  --keep-cluster-names ",p,"  --make-bed  --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

画LD heatmap

画LD用Tassel就可以话，上述代码也生成了vcf格式文件可以直接在Tassel中打开。这里用下LDheatmap。LDheatmap具体的教程比较多，自由度也比较高，可自行找些教程。当然也有在线能画LD图的，就是有点麻烦。这里具体就是如何将vcf或bed格式文件转到LDheatmap可用的格式。R代码如下：

library(LDheatmap)
library(genetics)

plot_LD <- function(vcf_file_name,title){
    temp_vcf <- as.data.frame(data.table::fread(vcf_file_name,header = T,sep = '\t'))
    if(nrow(temp_vcf )<=1) {print(paste0(vcf_file_name," have no snp&indel"));return()}###没有就结束

    temp_snp <- temp_vcf[0,]
    for (i in 1:nrow(temp_vcf)) {
        temp <- temp_vcf[i,]
        for(j in 10:ncol(temp_vcf)){
            if(temp_vcf[i,j]=="0/0"){
                temp[1,j] <- paste0(temp[1,4],"/",temp[1,4])
                next
            }
            if(temp_vcf[i,j]=="0/1"){
                temp[1,j] <- paste0(temp[1,4],"/",temp[1,5])
                next
            }
            if(temp_vcf[i,j]=="1/1"){
                temp[1,j] <- paste0(temp[1,5],"/",temp[1,5])
                next
            }
            if(temp_vcf[i,j]=="./."){
                temp[1,j] <- NA
                next
            }
        }
        temp_snp <- rbind(temp_snp,temp)
    }

    SNPpos <- temp_snp$POS
    SNP <- as.data.frame(t(temp_snp[,10:ncol(temp_snp)]))

    ### 转换成LDheatmap能用的格式
    for(i in 1:ncol(SNP)){
        SNP[,i]<-as.genotype(SNP[,i])
    }
    ###画图
    LD <- LDheatmap(SNP, SNPpos,flip=TRUE,color=color.rgb(20),title = title,SNP.name=NULL)
}

for(g in target_gene$GENE){
    plot_LD(paste0(out_name,"_","target_all_",g,".vcf"),g)
    for (p in pop_name) {
        plot_LD(paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p),".vcf"),paste0(g,"_",gsub(" ","_",p)))
    }
}

计算群体遗传性指标

首先推荐用vcftools来计算，因为方便。但理论上知道公式用excel也可以算，这里用R和plink来算Fst和pi。
计算Fst,Pi,Tajimi’D based on SNP vcf file
Fst详解（具体计算步骤）
【群体遗传学】 π （pi）的计算

Fst

需要分成至少两个群体每个群体一个文件，可以放所有的5个分类群体，也可以放几个感兴趣的群体。 vcftools 群体文件只要一列
bash代码

vcftools --vcf combine_base_filtered_indel.vcf --weir-fst-pop pop_GJ.txt --weir-fst-pop pop_XI.txt --weir-fst-pop pop_admix.txt --weir-fst-pop pop_aus.txt --weir-fst-pop pop_bas.txt --out combine_base_filtered_indel.fst ###单位点计算

vcftools --vcf combine_base_filtered_indel.vcf --weir-fst-pop pop_GJ.txt --weir-fst-pop pop_XI.txt --weir-fst-pop pop_admix.txt --weir-fst-pop pop_aus.txt --weir-fst-pop pop_bas.txt --fst-window-size 5000--out combine_base_filtered_indel.fst ###按windows计算 单位是bp

也可以利用plink 计算单位点的Fst。

total_pop<-length(fst_pop)

total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --fst --within plink_cluster.txt","  --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

###
for(p in fst_pop){
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --fst --within plink_cluster.txt  --keep-cluster-names ",p," --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

temp_table<-read.table(paste0(out_name,"_","target_all",".fst"),header=T)
fst <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
fst[,1] <- temp_table$FST

for (i in 1:total_pop) {
    temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",fst_pop[i]),".fst"),header=T)
    fst[,i+1] <- temp$FST
}

temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,fst)

names(temp) <- c("CHR","POS","fst",paste("fst",gsub(" ","_",fst_pop),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".fst.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存结果

结果和vcftools是一样的

pi

用vcftools计算可以单位点计算也可以按windows 计算
一般认为按windows更合理，如果研究目标是SNP，还是单点算合理。
在某个或多个群体中计算就加 --keep pop_GJ.txt --keep pop_XI.txt

vcftools --vcf combine_base_filtered_indel.vcf --site-pi --out combine_base_filtered_indel.pi ###单位点
vcftools --vcf combine_base_filtered_indel.vcf  --window-pi-step 3000  --out combine_base_filtered_indel ###3000个snp为一个windows

双等位的基因 pi值计算较简单，参考下上面计算fst的代码就可以了

total_pop<-length(pop_name)

total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --freq --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

###
for(p in pop_name){
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --freq --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

temp_table<-read.table(paste0(out_name,"_","target_all",".frq"),header=T)
pi <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
pi[,1] <- 2*temp_table$MAF*(1-temp_table$MAF)*temp_table$NCHROBS/(temp_table$NCHROBS-1)

for (j in 1:total_pop) {
    temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",pop_name[j]),".frq"),header=T)
    pi[,j+1] <- 2*temp$MAF*(1-temp$MAF)*temp$NCHROBS/(temp$NCHROBS-1)
}

temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,pi)

names(temp) <- c("CHR","POS","pi",paste("pi",gsub(" ","_",pop_name),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".pi"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存结果

pi结果比对
结果基本相同。

Tajima’s D 中性检测的指标

在某个或多个群体中计算就加 --keep pop_GJ.txt --keep pop_XI.txt

vcftools --vcf combine_base_filtered_indel.vcf --TajimaD 5000 --out combine_base_filtered_indel.TajimaD ###按5000bp计算

这个用R计算有点麻烦，但是用Tassel计算很方便，把vcf输入就可以了。
Tassel

单倍型分析

这里都拿到SNP数据集了，也就可以进行单倍型分析了，这里用的是Haploview，要先装java环境（但不支持1.8以上版本），有GUI比较友好。输入格式可以用plink格式(ped/map)。
但Haploview只能做SNP的分析，数量也不能太多。用前面的步骤生成了ped/info 文件，按linkage format 输入就可以了。
这里是利用了admix群体的48个SNP生成的结果。

但是我想要画单倍型网络图，这个软件不太方便，最后我选择了用beagle+dnasp+popart来做单倍型分型和画图。
beagle：用做单倍型推断和填充，需要java环境。
dnasp：用作单倍型分型，生成nex文件。
popart：用作单倍型画图，如果有地理信息，也可以加上去。
由于单倍型分析不允许有缺失，所以需要基因型填充，可以利用上面的vcf文件作为输入文件，输出为vcf.gz，解压后得到的vcf文件可以到后续利用。定相之后vcf里的"/“会变为”|"。另一个原因是3K水稻数据中并没有说明bed中的SNP是定相的，这里为了保险一点。
填充后的vcf
在powershell中输入命令

# gt 是需要的vcf文件(按基因位置分割的) out 是输出文件前缀
java -Xss5m   -jar .\beagle.28Jun21.220.jar gt="C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene_target_all.vcf" out=phased

得到定相后的vcf文件然后生成fasta文件，由于dnasp的输入fasta文件需要等长,这里对有indel的地方加了"-"。每个样本能生成两条单链。
这里用R解决

phase_file <- "D:\\beagle\\phased.vcf" #vcf文件位置
out_name <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene" #输出文件前缀
g <- "g" #输出前缀

vcf <- data.table::fread(phase_file,header = T)

temp <- c()
for (i in 1:nrow(vcf)) {
    temp_value <- vcf[i,10:ncol(vcf)]
    ref <- vcf[i,4]
    alt <- vcf[i,5]
    if(nchar(alt)>nchar(ref)){
    alt <- substr(alt,2,nchar(alt))
    ref <- paste(rep("-",nchar(alt)),collapse = "")
    }
    if(nchar(ref)>nchar(alt)){
        ref <- substr(ref,2,nchar(ref))
        alt <- paste(rep("-",nchar(ref)),collapse = "")
    }
    temp_value[,which(temp_value=="0|0")] <- paste(ref,ref)
    temp_value[,which(temp_value=="0|1")] <- paste(ref,alt)
    temp_value[,which(temp_value=="1|0")] <- paste(alt,ref)
    temp_value[,which(temp_value=="1|1")] <- paste(alt,alt)
    temp <- rbind(temp,temp_value)
}
temp <- as.data.frame(temp)

cat("",file=paste0(out_name,g,".fasta"),append = FALSE)

for (i in 1:ncol(temp)) {
    c1 <- c()
    c2 <- c()
    for (j in 1:nrow(temp)) {
        c1 <- c(c1,strsplit(temp[j,i]," ")[[1]][1])
        c2 <- c(c2,strsplit(temp[j,i]," ")[[1]][2])
    }
    cat(paste(">",paste0(names(temp)[i],".1"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste0(paste(c1,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste(">",paste0(names(temp)[i],".2"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste0(paste(c2,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
}

fasta
dnasp
输入fasta后，Generate 选Haplotype 生成单倍型，得到单倍型文件(nex)

这里一共是得到315个单倍型，但大部分都只有1个样本。
由于需要统计样本的信息，这里把nex的freq字段另存为test.txt()，用来做统计，并生成TraitLabels字段，这里需要样本的分类信息plink_cluster.txt。

freq字段

clust <- read.table("plink_cluster.txt",header = T,sep = "\t")
hap <- read.table("test.txt",header = F,sep = ":")
hap[,1] <- gsub("\\[","",hap[,1])
hap[,2] <- gsub("\\]","",hap[,2])

clut_hap <- clust[,2:3]

tem <- as.data.frame(strsplit(hap[,2],"    "))[2,]
hap[,3] <-t(tem)
hap[,3] <- gsub("\\.1","",hap[,3])
hap[,3] <- gsub("\\.2","",hap[,3])

for (i in 1:nrow(clust)) {
    flag <- 0
    v1 <- 0
    v2 <- 0
    for (j in 1:nrow(hap)) {
        if (flag==2) break
        if(clust[i,1] %in% strsplit(hap[j,3]," ")[[1]]){
            idx <- which(strsplit(hap[j,3]," ")[[1]] %in% clust[i,1])
            if(length(idx)==1&flag==0) {v1 <- hap[j,1];flag <- flag+1;next}
            if(length(idx)==1&flag==1) {v2 <- hap[j,1];flag <- flag+1;next}
            if(length(idx)==2&flag==0) {v1 <- hap[j,1];v2 <- hap[j,1];next}
        }
    }
    clut_hap[i,3] <- v1
    clut_hap[i,4] <- v2
}

write.table(clut_hap,file=paste0(out_name,g,"_clut_hap.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE) #保存结果
names(clut_hap) <- c("ID","group"," "," ")
tem <- rbind(clut_hap[,c(3,2)],clut_hap[,c(4,2)])
tem <- tem[!is.na(tem[,2]),]
temp <- table(tem, exclude = 0)

out_matrix <- data.frame(matrix(temp,nrow = length(row.names(temp))))
row.names(out_matrix) <- row.names(temp) 
names(out_matrix) <- gsub(" ","_",colnames(temp))

out_matrix$idx <- as.numeric(gsub("Hap_","",row.names(out_matrix)))
out_matrix <- out_matrix[order(out_matrix$idx),]
out_matrix <- out_matrix[,-ncol(out_matrix)]

cat("\n",file=paste0(out_name,g,".prenex"),append = FALSE)
cat("BEGIN TRAITS;\n",file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Dimensions NTRAITS=",ncol(out_matrix),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Format labels=yes missing=? separator=Tab",";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("TraitLabels\t",paste(names(out_matrix),collapse = "\t"),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Matrix","\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
write.table(out_matrix,file=paste0(out_name,g,".prenex"),sep = "\t",append = TRUE,row.names = TRUE, col.names = F, quote = FALSE)

生成文件
把生成的文件全部内容贴到dnasp生成nex文件底部。
但我在用popart的时后，文件还是导不进去。发现需要在nex文件删掉一段。

然后就用popart可视化就好了。

315个单倍型网络图
这个太多了，手动去了下，只保留17个单倍型(修改TAXA,CHARACTERS,TRAITS三个字段)。

17个单倍型网络
也可以把分类信息改成地理信息
单倍型地理分布图
我发现popart计算出的中性突变系数和Tassel是一样的。
单倍型统计

都有单倍型信息了也顺便做下方差分析好了。
这里用RFGB下载到的GL表型试一下。
正常的关联分析应该用的是混合线性模型。
Statistical methods of gcHap-based GWAS
但这里只有一个基因上的单倍型信息，用整体的协方差或亲缘关系矩阵会把模型变复杂，个人认为还是用方差分析比较合适。（这方法有待商榷）

pheno <- read.table("grain_length.txt",header = F,sep = "\t") #表型文件
genotype <- read.table(paste0(out_name,g,"_clut_hap.txt"),header = F,sep = "\t") #单倍型文件
thr <- 50 #去掉样本少的单倍型
tem <- merge(genotype[,c(1,4)],pheno,by=c("V1"))
colnames(tem) <- c("V1","V3","V2")
tem <- rbind(tem, merge(genotype[,c(1,3)],pheno,by=c("V1")))
tem <- merge(tem,genotype[,c(1,2)],by= c("V1"))
colnames(tem) <- c("ID","hap","pheno","cluster")
freq <- as.data.frame(table(tem$hap))
pass_filt <-  as.character(freq[freq[,2]>=thr,1])
data <- tem[which(tem$hap %in% pass_filt),]

library(car)
library(agricolae)
mod <- aov(pheno~factor(hap),data) #方差分析
Anova(mod,type=3)
oneway.test(pheno~factor(hap),data,var.equal = F)#方差不齐的方差分析
out <- scheffe.test(mod,"factor(hap)") #可以用于非平衡数据的多重比较
out$groups 
plot(out)

方差分析结果
简单的方差分析可得hap_101的粒长较长。

#做个总结
上述主要是提供了一个在所有7.1m的SNP数据集中，提取部分位点的标记，并计算LD和一些群体遗传指标，基本上是R+plink就解决了。
当然如果手头上有一些品种的重测序数据，也可以和3k水稻的SNP数据集放到一起进行比对，相信这个数据集还有很多信息没有被挖掘，共勉。

#R代码汇总
提取位点，群体指标计算

#基本参数
gene_list <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene_len.txt"  #目标区间
out_name <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene" #输出文件前缀
data_name <- c(".\\data\\base_filtered_v0.7",".\\data\\Nipponbare_indel") #SNP数据集,输入多个就是合并
pop_name<-c("GJ","XI","admix","aus","bas","GJ XI","admix aus bas") #需要生成的群体
fst_pop <- c("GJ XI","admix aus bas") #计算fst需要的群体 需在pop_name出现过
#质控参数
maf <- 0.05 
miss <- 0.4

color.rgb <- colorRampPalette(rev(c("white","red")),space="rgb") #color for LD

target_gene <- read.table(gene_list,header = T)
extract <- cbind(target_gene $CHROM,target_gene $start,target_gene $end,target_gene $GENE)
write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)

###
#提取目标区域位点 并输出vcf文件 可以用Tassel 看LD
#查看样本是否缺少
tem <- c()
for (i in 1:length(data_name)) {
    temp <- read.table(paste0(data_name[i],".fam"),header = F)
    temp <- nrow(temp)
    tem <- c(tem,temp)
}
#用少了样本的数据集来keep
keep_file <- data_name[which(tem %in% min(tem))]

for (i in 1:length(data_name)) {
    total_cmd <- paste0(".//plink --bfile ",data_name[i] ," --extract temp_extract.txt --range --keep ",paste0(keep_file,".fam")," --recode vcf-iid --out ",paste0(out_name,"_",i))
    system(total_cmd,intern = FALSE)
}
temp_all <- read.table(paste0(out_name,"_",1,".vcf"),header = F,sep = '\t')

if (length(data_name)>=2){    #合并数据集
    for (i in 2:length(data_name)) {
        temp <- read.table(paste0(out_name,"_",i,".vcf"),header = F,sep = '\t')
        temp_all<-rbind(temp_all ,temp )
    }
    temp_all<-temp_all[order(temp_all[,1],temp_all[,2]),] #按照染色体和POS排序
}

### 列名
fam_name<-read.table(paste0(keep_file,".fam") ,header = F,sep = ' ')[,1]
vcf_colnames<-c("#CHROM",	"POS","ID","REF","ALT","QUAL","FILTER","INFO","FORMAT")
vcf_colnames<-c(vcf_colnames,fam_name)
###输出结果，vcf
names(temp_all)<-vcf_colnames
write.table(temp_all,file=paste0(out_name,"_","target_all.vcf"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE)
###筛选
total_cmd <- paste0("plink --vcf ",paste0(out_name,"_","target_all.vcf")," --const-fid --maf ",maf," --geno ",miss," --make-bed --out ",paste0(out_name,"_","target_all")) #直接用vcf筛选会有点问题，还是再转下bed格式
system(total_cmd,intern = FALSE)

temp <- read.table(paste0(out_name,"_","target_all",".fam"),header = F)
temp[,1] <- temp[,2]
write.table(temp,file=paste0(out_name,"_","target_all",".fam"),sep = " ",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE)
total_cmd <- paste0(".//plink --bfile ",paste0(out_name,"_","target_all") ," --recode vcf-iid --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

#不同群体提取 按基因提取 
for(g in target_gene$GENE){
    extract <-dplyr::filter(target_gene ,GENE==g)
    extract <- cbind(extract$CHROM,extract$start,extract$end,extract$GENE)
    write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range "," --recode --make-bed --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)

    for (p in pop_name) {
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
###        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode --make-bed --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
###       system(total_cmd,intern = FALSE)
    }
}

for (p in pop_name) {
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --within plink_cluster.txt  --keep-cluster-names ",p,"  --make-bed  --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}
###
library(LDheatmap)
library(genetics)

plot_LD <- function(vcf_file_name,title){
    temp_vcf <- as.data.frame(data.table::fread(vcf_file_name,header = T,sep = '\t'))
    if(nrow(temp_vcf )<=1) {print(paste0(vcf_file_name," have no snp&indel"));return()}###没有就结束

    temp_snp <- temp_vcf[0,]
    for (i in 1:nrow(temp_vcf)) {
        temp <- temp_vcf[i,]
        for(j in 10:ncol(temp_vcf)){
            if(temp_vcf[i,j]=="0/0"){
                temp[1,j] <- paste0(temp[1,4],"/",temp[1,4])
                next
            }
            if(temp_vcf[i,j]=="0/1"){
                temp[1,j] <- paste0(temp[1,4],"/",temp[1,5])
                next
            }
            if(temp_vcf[i,j]=="1/1"){
                temp[1,j] <- paste0(temp[1,5],"/",temp[1,5])
                next
            }
            if(temp_vcf[i,j]=="./."){
                temp[1,j] <- NA
                next
            }
        }
        temp_snp <- rbind(temp_snp,temp)
    }

    SNPpos <- temp_snp$POS
    SNP <- as.data.frame(t(temp_snp[,10:ncol(temp_snp)]))

    ### 转换成LDheatmap能用的格式
    for(i in 1:ncol(SNP)){
        SNP[,i]<-as.genotype(SNP[,i])
    }
    ###画图
    LD <- LDheatmap(SNP, SNPpos,flip=TRUE,color=color.rgb(20),title = title,SNP.name=NULL)
}

for(g in target_gene$GENE){
    plot_LD(paste0(out_name,"_","target_all_",g,".vcf"),g)
    for (p in pop_name) {
        plot_LD(paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p),".vcf"),paste0(g,"_",gsub(" ","_",p)))
    }
}
###
total_pop<-length(fst_pop)

total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --fst --within plink_cluster.txt","  --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

###
for(p in fst_pop){
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --fst --within plink_cluster.txt  --keep-cluster-names ",p," --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

temp_table<-read.table(paste0(out_name,"_","target_all",".fst"),header=T)
fst <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
fst[,1] <- temp_table$FST

for (i in 1:total_pop) {
    temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",fst_pop[i]),".fst"),header=T)
    fst[,i+1] <- temp$FST
}

temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,fst)

names(temp) <- c("CHR","POS","fst",paste("fst",gsub(" ","_",fst_pop),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".fst.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存结果

###
total_pop<-length(pop_name)

total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --freq --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

###
for(p in pop_name){
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --freq --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

temp_table<-read.table(paste0(out_name,"_","target_all",".frq"),header=T)
pi <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
pi[,1] <- 2*temp_table$MAF*(1-temp_table$MAF)*temp_table$NCHROBS/(temp_table$NCHROBS-1)

for (j in 1:total_pop) {
    temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",pop_name[j]),".frq"),header=T)
    pi[,j+1] <- 2*temp$MAF*(1-temp$MAF)*temp$NCHROBS/(temp$NCHROBS-1)
}

temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,pi)

names(temp) <- c("CHR","POS","pi",paste("pi",gsub(" ","_",pop_name),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".pi"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存结果
###

vcf to fasta

phase_file <- "D:\\beagle\\phased.vcf.vcf"
g <- g

vcf <- data.table::fread(phase_file,header = T)

temp <- c()
for (i in 1:nrow(vcf)) {
    temp_value <- vcf[i,10:ncol(vcf)]
    ref <- vcf[i,4]
    alt <- vcf[i,5]
    if(nchar(alt)>nchar(ref)){
    alt <- substr(alt,2,nchar(alt))
    ref <- paste(rep("-",nchar(alt)),collapse = "")
    }
    if(nchar(ref)>nchar(alt)){
        ref <- substr(ref,2,nchar(ref))
        alt <- paste(rep("-",nchar(ref)),collapse = "")
    }
    temp_value[,which(temp_value=="0|0")] <- paste(ref,ref)
    temp_value[,which(temp_value=="0|1")] <- paste(ref,alt)
    temp_value[,which(temp_value=="1|0")] <- paste(alt,ref)
    temp_value[,which(temp_value=="1|1")] <- paste(alt,alt)
    temp <- rbind(temp,temp_value)
}
temp <- as.data.frame(temp)

cat("",file=paste0(out_name,g,".fasta"),append = FALSE)

for (i in 1:ncol(temp)) {
    c1 <- c()
    c2 <- c()
    for (j in 1:nrow(temp)) {
        c1 <- c(c1,strsplit(temp[j,i]," ")[[1]][1])
        c2 <- c(c2,strsplit(temp[j,i]," ")[[1]][2])
    }
    cat(paste(">",paste0(names(temp)[i],".1"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste0(paste(c1,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste(">",paste0(names(temp)[i],".2"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste0(paste(c2,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
}

nex traits 文件生成

clust <- read.table("plink_cluster.txt",header = T,sep = "\t")
hap <- read.table("test.txt",header = F,sep = ":")
hap[,1] <- gsub("\\[","",hap[,1])
hap[,2] <- gsub("\\]","",hap[,2])

clut_hap <- clust[,2:3]

tem <- as.data.frame(strsplit(hap[,2],"    "))[2,]
hap[,3] <-t(tem)
hap[,3] <- gsub("\\.1","",hap[,3])
hap[,3] <- gsub("\\.2","",hap[,3])

for (i in 1:nrow(clust)) {
    flag <- 0
    v1 <- 0
    v2 <- 0
    for (j in 1:nrow(hap)) {
        if (flag==2) break
        if(clust[i,1] %in% strsplit(hap[j,3]," ")[[1]]){
            idx <- which(strsplit(hap[j,3]," ")[[1]] %in% clust[i,1])
            if(length(idx)==1&flag==0) {v1 <- hap[j,1];flag <- flag+1;next}
            if(length(idx)==1&flag==1) {v2 <- hap[j,1];flag <- flag+1;next}
            if(length(idx)==2&flag==0) {v1 <- hap[j,1];v2 <- hap[j,1];next}
        }
    }
    clut_hap[i,3] <- v1
    clut_hap[i,4] <- v2
}

write.table(clut_hap,file=paste0(out_name,g,"_clut_hap.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE) #保存结果
names(clut_hap) <- c("ID","group"," "," ")
tem <- rbind(clut_hap[,c(3,2)],clut_hap[,c(4,2)])
tem <- tem[!is.na(tem[,2]),]
temp <- table(tem, exclude = 0)

out_matrix <- data.frame(matrix(temp,nrow = length(row.names(temp))))
row.names(out_matrix) <- row.names(temp) 
names(out_matrix) <- gsub(" ","_",colnames(temp))

out_matrix$idx <- as.numeric(gsub("Hap_","",row.names(out_matrix)))
out_matrix <- out_matrix[order(out_matrix$idx),]
out_matrix <- out_matrix[,-ncol(out_matrix)]

cat("\n",file=paste0(out_name,g,".prenex"),append = FALSE)
cat("BEGIN TRAITS;\n",file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Dimensions NTRAITS=",ncol(out_matrix),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Format labels=yes missing=? separator=Tab",";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("TraitLabels\t",paste(names(out_matrix),collapse = "\t"),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Matrix","\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
write.table(out_matrix,file=paste0(out_name,g,".prenex"),sep = "\t",append = TRUE,row.names = TRUE, col.names = F, quote = FALSE)

方差分析

pheno <- read.table("grain_length.txt",header = F,sep = "\t")
genotype <- read.table(paste0(out_name,g,"_clut_hap.txt"),header = F,sep = "\t")
thr <- 50
tem <- merge(genotype[,c(1,4)],pheno,by=c("V1"))
colnames(tem) <- c("V1","V3","V2")
tem <- rbind(tem, merge(genotype[,c(1,3)],pheno,by=c("V1")))
tem <- merge(tem,genotype[,c(1,2)],by= c("V1"))
colnames(tem) <- c("ID","hap","pheno","cluster")
freq <- as.data.frame(table(tem$hap))
pass_filt <-  as.character(freq[freq[,2]>=thr,1])
data <- tem[which(tem$hap %in% pass_filt),]

library(car)
library(agricolae)
mod <- aov(pheno~factor(hap),data)
Anova(mod,type=3)
oneway.test(pheno~factor(hap),data,var.equal = F)
out <- scheffe.test(mod,"factor(hap)") #可以用于非平衡数据的多重比较
out$groups
plot(out)

游戏开发最新文章

6、英飞凌-AURIX-TC3XX： PWM实验之使用 GT

泛型自动装箱

CubeMax添加Rtthread操作系统组件STM32F10

python多线程编程：如何优雅地关闭线程

数据类型隐式转换导致的阻塞

WebAPi实现多文件上传，并附带参数

from origin ‘null‘ has been blocked by

UE4 蓝图调用C++函数（附带项目工程）

Unity学习笔记（一）结构体的简单理解与应用

【Memory As a Programming Concept in C a

加:2021-09-18 10:33:04 更:2021-09-18 10:35:16

360图书馆购物三丰科技阅读网日历万年历 2025年7日历

-2025/7/31 20:22:14-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码