IT数码 购物 网址 头条 软件 日历 阅读 图书馆
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
图片批量下载器
↓批量下载图片,美女图库↓
图片自动播放器
↓图片自动播放器↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁
 
   -> 游戏开发 -> 3K水稻SNP数据集的简单利用 -> 正文阅读

[游戏开发]3K水稻SNP数据集的简单利用

3010份亚洲稻群体重测序项目是由中国农业科学院作物科学研究所牵头,联合国际水稻研究所、华大基因等16家单位共同完成。该研究对来自89个国家的3,010份水稻品种进行了重测序研究(resequence),参考的是日本晴(Nipponbare)基因组。所有3,010个基因组的平均测序深度(average sequencing depth)为14×,平均基因组覆盖率(average genome coverage)和绘图率(average mapping rate)分别为94.0%和92.5%。
该项目的意义自然不用多说,并且该项目的测序数据都可以在国际水稻所CNGB上下载。我们可以从原始序列开始组装得到大的结构变异 或call到一些稀有SNP来做研究。当然以上想法不在本文讨论范围内,本文主要讨论只想知道几个基因区域上 SNP 及其分化系数等群体指标的计算。

2021.09.15 更新了单倍型分型、作图和方差分析。

放在前面

RFGB可以极大程度上满足查找某个基因上的SNP&indel和单倍型查找。并提供15个表型可以进行单倍型分析,也可以上传表型进行分析。接下来探索一下。
单倍型分析
这个页面提供单倍型分析。分别可以用基因名,染色体位置和特定的SNP来进行分析。提供MAF是最小等位基因频率,建议选一下。可以用全部的3024个品种进行查找也可以分亚群进行查找,也提供了表型关联分析。
结果

结果
这里是用的LOC_Os08g33530 CDS 区域,选的籼稻群体,共找到20个SNP,和16个单倍型。其中hap11的谷粒长最高。
这个网站用来做单倍型都是SNP,都是复等位的,就是没找到在哪下数据集。
可惜没有LD的图和pi值。
但我找到了水稻单倍型的数据,具体可看文章
The landscape of gene–CDS–haplotype diversity in rice: Properties, population organization, footprints of domestication and breeding, and implications for genetic improvement
数据地址


也可以用这个单倍型数据和表型做关联分析,用方差分析就可以。表型数据下载


以下主要是3kRG构建的SNP的探索以及简单利用。

工具准备

R
R package(LDheatmap,genetics)
plink1.9(3个操作系统版本都有)下载地址
vcftools(只有liunx版本,只是用来算群体遗传统计)下载地址

数据下载

可以在IRRI上下载到双等位SNP&indel数据(多等位的SNP貌似只能从头call,没找到下载的地方)。bed、bim、fam文件为一组,bed 是存放位点信息的二进制格式文件,用文本打开会发现是乱码;bim是存放位点信息的文件,类似于map格式;fam是存放样本信息的文件,第一列为FID,第二列为IID。
bim
fam
Nipponbare_indel 是双等位indel数据集,共有2.3m个,但是是没有经过筛选的,后续需要加工下。
双等位的SNP标记共5个。
NB_final_snp 包含了所有的SNP,共29m个,是没有经过筛选的。
Base 的SNP是经过基础筛选的,共18m个。
base_filtered_v0.7 是从Base的SNP进一步筛选的,共4.2m个,本文主要利用这个数据集。
pruned_v2.1 是2.1版本的筛选方案,包含1m个SNP。
base_filtered_v0.7_0.8_10kb_1_0.8_50_1 是在base_filtered_v0.7 的基础上进一步筛选 共 404k 个,在用的时候发现很多基因区域上都没有SNP,毕竟个数少,比较稀疏。
各个数据集的筛选方法可以看看里边的readme文件,不同的筛选方案主要目的就是提高精度,找到因果变异。
#准备目标基因的位置
其实只需要染色体号和起始位置就好了,基因名字写啥都行,就是拿来做图片标题的。忘记提了,这个数据集中没有线粒体和叶绿体上的标记。建议起始位置可以把上游和下游都囊括进去,多个2kb,3kb 都是可以的,多找几个总是好的,万一和表型关联上了呢。基因列表,以tab分割

合并想要的SNP数据集&indel数据集

其实本可以一个个数据集来,但我想把2.3m个indel和4.2m个SNP标记合在一起分析咋办呢。最简单的办法还是用vcftools或bcftools合并两个数据集。可以参考这篇。bash代码如下:

plink --bfile base_filtered_v0.7 --recode vcf-iid --out base_filtered_v0.7 #先转到vcf格式
plink --bfile Nipponbare_indel --recode vcf-iid --out Nipponbare_indel
###由于Nipponbare_indel 中少一个样本 IRIS_313-8921 并且这个样本在官网给的分类为NA,空值。所以就需要删掉它
vcftools --vcf base_filtered_v0.7.vcf --recode --recode-INFO-all --stdout --indv IRIS_313-8921 > rm_na_base_filtered_v0.7.vcf

bcftools view Nipponbare_indel.vcf -Oz -o Nipponbare_indel.vcf.gz  #构建引索
bcftools index Nipponbare_indel.vcf.gz 

bcftools view rm_na_base_filtered_v0.7.vcf -Oz -o rm_na_base_filtered_v0.7.vcf.gz
bcftools index rm_na_base_filtered_v0.7.vcf.gz

bcftools concat rm_na_base_filtered_v0.7.vcf.gz Nipponbare_indel.vcf.gz -a -O v -o combine_base_filtered_indel.vcf #合并两个vcf文件

plink --vcf combine_base_filtered_indel.vcf --make-bed --out  combine_base_filtered_indel #再转回bed格式 主要是这样占内存小点,这个vcf文件有55G ,bed文件只有5.23G

那么就不想用vcftools合并两个数据集咋办呢?有些时候并不想得到所有位点的信息,只需要一部分就可以了。其实可以用R和plink来实现合并两个数据集,并只提取几个位点,这种方案内存代价小点,且看后续讲解。

准备分类文件

群体分类文件如下原始文件
这个分类太细了,给它分成五类就可以了,GJ、XI、admix、aus、bas,用excel的分列就能完成,需要把IRIS_313-8921去掉,这个分类是NA。当然也可以用自定义的分类。
整合成plink能识别的分类格式,分别是FID,IID,classplink 格式

突然想起踩到的一个坑,plink1.9和2在筛选的时候无法识别样本名称带有"-"字符,所以需要把 分类文件里 “-“换成”_” ,"IRIS_313-8921"换成"IRIS_313_8921"如用 excel 就可以完成,对了fam文件里也要换。

目标区域SNP位点提取

首先需要把plink.exe和分类文件(plink_cluster.txt)放到工作目录下,方便调用,也可以把它放到环境目录下。

R代码如下:

#基本参数
gene_list <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene_len.txt"  #目标区间
out_name <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene" #输出文件前缀
data_name <- c(".\\data\\base_filtered_v0.7",".\\data\\Nipponbare_indel") #SNP数据集,输入多个就是合并
pop_name<-c("GJ","XI","admix","aus","bas","GJ XI","admix aus bas") #需要生成的群体
fst_pop <- c("GJ XI","admix aus bas") #计算fst需要的群体 需在pop_name出现过
#质控参数
maf <- 0.05 
miss <- 0.4

color.rgb <- colorRampPalette(rev(c("white","red")),space="rgb") #color for LD

target_gene <- read.table(gene_list,header = T)
extract <- cbind(target_gene $CHROM,target_gene $start,target_gene $end,target_gene $GENE)
write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
#提取目标区域位点 并输出vcf文件 可以用Tassel 看LD
#查看样本是否缺少
tem <- c()
for (i in 1:length(data_name)) {
    temp <- read.table(paste0(data_name[i],".fam"),header = F)
    temp <- nrow(temp)
    tem <- c(tem,temp)
}
#用少了样本的数据集来keep
keep_file <- data_name[which(tem %in% min(tem))]

for (i in 1:length(data_name)) {
    total_cmd <- paste0(".//plink --bfile ",data_name[i] ," --extract temp_extract.txt --range --keep ",paste0(keep_file,".fam")," --recode vcf-iid --out ",paste0(out_name,"_",i))
    system(total_cmd,intern = FALSE)
}
temp_all <- read.table(paste0(out_name,"_",1,".vcf"),header = F,sep = '\t')

if (length(data_name)>=2){    #合并数据集
    for (i in 2:length(data_name)) {
        temp <- read.table(paste0(out_name,"_",i,".vcf"),header = F,sep = '\t')
        temp_all<-rbind(temp_all ,temp )
    }
    temp_all<-temp_all[order(temp_all[,1],temp_all[,2]),] #按照染色体和POS排序
}

### 列名
fam_name<-read.table(paste0(keep_file,".fam") ,header = F,sep = ' ')[,1]
vcf_colnames<-c("#CHROM",	"POS","ID","REF","ALT","QUAL","FILTER","INFO","FORMAT")
vcf_colnames<-c(vcf_colnames,fam_name)
###输出结果,vcf
names(temp_all)<-vcf_colnames
write.table(temp_all,file=paste0(out_name,"_","target_all.vcf"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE)
###筛选
total_cmd <- paste0("plink --vcf ",paste0(out_name,"_","target_all.vcf")," --const-fid --maf ",maf," --geno ",miss," --make-bed --out ",paste0(out_name,"_","target_all")) #直接用vcf筛选会有点问题,还是再转下bed格式
system(total_cmd,intern = FALSE)

temp <- read.table(paste0(out_name,"_","target_all",".fam"),header = F)
temp[,1] <- temp[,2]
write.table(temp,file=paste0(out_name,"_","target_all",".fam"),sep = " ",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE)
total_cmd <- paste0(".//plink --bfile ",paste0(out_name,"_","target_all") ," --recode vcf-iid --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

#不同群体提取 按基因提取 
for(g in target_gene$GENE){
    extract <-dplyr::filter(target_gene ,GENE==g)
    extract <- cbind(extract$CHROM,extract$start,extract$end,extract$GENE)
    write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range "," --make-bed --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)

    for (p in pop_name) {
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
###        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --make-bed --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
###        system(total_cmd,intern = FALSE)
    }
}

for (p in pop_name) {
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --within plink_cluster.txt  --keep-cluster-names ",p,"  --make-bed  --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

画LD heatmap

画LD用Tassel就可以话,上述代码也生成了vcf格式文件可以直接在Tassel中打开。这里用下LDheatmap。LDheatmap具体的教程比较多,自由度也比较高,可自行找些教程。当然也有在线能画LD图的,就是有点麻烦。这里具体就是如何将vcf或bed格式文件转到LDheatmap可用的格式。R代码如下:

library(LDheatmap)
library(genetics)

plot_LD <- function(vcf_file_name,title){
    temp_vcf <- as.data.frame(data.table::fread(vcf_file_name,header = T,sep = '\t'))
    if(nrow(temp_vcf )<=1) {print(paste0(vcf_file_name," have no snp&indel"));return()}###没有就结束

    temp_snp <- temp_vcf[0,]
    for (i in 1:nrow(temp_vcf)) {
        temp <- temp_vcf[i,]
        for(j in 10:ncol(temp_vcf)){
            if(temp_vcf[i,j]=="0/0"){
                temp[1,j] <- paste0(temp[1,4],"/",temp[1,4])
                next
            }
            if(temp_vcf[i,j]=="0/1"){
                temp[1,j] <- paste0(temp[1,4],"/",temp[1,5])
                next
            }
            if(temp_vcf[i,j]=="1/1"){
                temp[1,j] <- paste0(temp[1,5],"/",temp[1,5])
                next
            }
            if(temp_vcf[i,j]=="./."){
                temp[1,j] <- NA
                next
            }
        }
        temp_snp <- rbind(temp_snp,temp)
    }

    SNPpos <- temp_snp$POS
    SNP <- as.data.frame(t(temp_snp[,10:ncol(temp_snp)]))

    ### 转换成LDheatmap能用的格式
    for(i in 1:ncol(SNP)){
        SNP[,i]<-as.genotype(SNP[,i])
    }
    ###画图
    LD <- LDheatmap(SNP, SNPpos,flip=TRUE,color=color.rgb(20),title = title,SNP.name=NULL)
}

for(g in target_gene$GENE){
    plot_LD(paste0(out_name,"_","target_all_",g,".vcf"),g)
    for (p in pop_name) {
        plot_LD(paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p),".vcf"),paste0(g,"_",gsub(" ","_",p)))
    }
}

结果

计算群体遗传性指标

首先推荐用vcftools来计算,因为方便。但理论上知道公式用excel也可以算,这里用R和plink来算Fst和pi。
计算Fst,Pi,Tajimi’D based on SNP vcf file
Fst详解(具体计算步骤)
【群体遗传学】 π (pi)的计算

Fst

需要分成至少两个群体 每个群体一个文件,可以放所有的5个分类群体,也可以放几个感兴趣的群体。vcftools 群体文件 只要一列
bash代码

vcftools --vcf combine_base_filtered_indel.vcf --weir-fst-pop pop_GJ.txt --weir-fst-pop pop_XI.txt --weir-fst-pop pop_admix.txt --weir-fst-pop pop_aus.txt --weir-fst-pop pop_bas.txt --out combine_base_filtered_indel.fst ###单位点计算

vcftools --vcf combine_base_filtered_indel.vcf --weir-fst-pop pop_GJ.txt --weir-fst-pop pop_XI.txt --weir-fst-pop pop_admix.txt --weir-fst-pop pop_aus.txt --weir-fst-pop pop_bas.txt --fst-window-size 5000--out combine_base_filtered_indel.fst ###按windows计算 单位是bp

也可以利用plink 计算单位点的Fst。

total_pop<-length(fst_pop)

total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --fst --within plink_cluster.txt","  --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

###
for(p in fst_pop){
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --fst --within plink_cluster.txt  --keep-cluster-names ",p," --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

temp_table<-read.table(paste0(out_name,"_","target_all",".fst"),header=T)
fst <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
fst[,1] <- temp_table$FST

for (i in 1:total_pop) {
    temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",fst_pop[i]),".fst"),header=T)
    fst[,i+1] <- temp$FST
}

temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,fst)

names(temp) <- c("CHR","POS","fst",paste("fst",gsub(" ","_",fst_pop),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".fst.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存结果

结果和vcftools是一样的

pi

用vcftools计算 可以单位点计算 也可以按windows 计算
一般认为按windows更合理,如果研究目标是SNP,还是单点算合理。
在某个或多个群体中计算就加 --keep pop_GJ.txt --keep pop_XI.txt

vcftools --vcf combine_base_filtered_indel.vcf --site-pi --out combine_base_filtered_indel.pi ###单位点
vcftools --vcf combine_base_filtered_indel.vcf  --window-pi-step 3000  --out combine_base_filtered_indel ###3000个snp为一个windows

双等位的基因 pi值计算较简单,参考下上面计算fst的代码就可以了

total_pop<-length(pop_name)

total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --freq --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

###
for(p in pop_name){
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --freq --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

temp_table<-read.table(paste0(out_name,"_","target_all",".frq"),header=T)
pi <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
pi[,1] <- 2*temp_table$MAF*(1-temp_table$MAF)*temp_table$NCHROBS/(temp_table$NCHROBS-1)

for (j in 1:total_pop) {
    temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",pop_name[j]),".frq"),header=T)
    pi[,j+1] <- 2*temp$MAF*(1-temp$MAF)*temp$NCHROBS/(temp$NCHROBS-1)
}

temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,pi)

names(temp) <- c("CHR","POS","pi",paste("pi",gsub(" ","_",pop_name),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".pi"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存结果

pi结果比对
结果基本相同。

Tajima’s D 中性检测的指标

在某个或多个群体中计算就加 --keep pop_GJ.txt --keep pop_XI.txt

vcftools --vcf combine_base_filtered_indel.vcf --TajimaD 5000 --out combine_base_filtered_indel.TajimaD ###按5000bp计算

这个用R计算有点麻烦,但是用Tassel计算很方便,把vcf输入就可以了。
Tassel

单倍型分析

这里都拿到SNP数据集了,也就可以进行单倍型分析了,这里用的是Haploview,要先装java环境(但不支持1.8以上版本),有GUI比较友好。输入格式可以用plink格式(ped/map)。
但Haploview只能做SNP的分析,数量也不能太多。用前面的步骤生成了ped/info 文件,按linkage format 输入就可以了。
这里是利用了admix群体的48个SNP生成的结果。
LD

单倍型
但是我想要画单倍型网络图,这个软件不太方便,最后我选择了用beagle+dnasp+popart来做单倍型分型和画图。
beagle:用做单倍型推断和填充,需要java环境。
dnasp:用作单倍型分型,生成nex文件。
popart:用作单倍型画图,如果有地理信息,也可以加上去。
由于单倍型分析不允许有缺失,所以需要基因型填充,可以利用上面的vcf文件作为输入文件,输出为vcf.gz,解压后得到的vcf文件可以到后续利用。定相之后vcf里的"/“会变为”|"。另一个原因是3K水稻数据中并没有说明bed中的SNP是定相的,这里为了保险一点。
填充后的vcf
在powershell中输入命令

# gt 是需要的vcf文件(按基因位置分割的) out 是输出文件前缀
java -Xss5m   -jar .\beagle.28Jun21.220.jar gt="C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene_target_all.vcf" out=phased

得到定相后的vcf文件然后生成fasta文件,由于dnasp的输入fasta文件需要等长,这里对有indel的地方加了"-"。每个样本能生成两条单链。
这里用R解决

phase_file <- "D:\\beagle\\phased.vcf" #vcf文件位置
out_name <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene" #输出文件前缀
g <- "g" #输出前缀

vcf <- data.table::fread(phase_file,header = T)

temp <- c()
for (i in 1:nrow(vcf)) {
    temp_value <- vcf[i,10:ncol(vcf)]
    ref <- vcf[i,4]
    alt <- vcf[i,5]
    if(nchar(alt)>nchar(ref)){
    alt <- substr(alt,2,nchar(alt))
    ref <- paste(rep("-",nchar(alt)),collapse = "")
    }
    if(nchar(ref)>nchar(alt)){
        ref <- substr(ref,2,nchar(ref))
        alt <- paste(rep("-",nchar(ref)),collapse = "")
    }
    temp_value[,which(temp_value=="0|0")] <- paste(ref,ref)
    temp_value[,which(temp_value=="0|1")] <- paste(ref,alt)
    temp_value[,which(temp_value=="1|0")] <- paste(alt,ref)
    temp_value[,which(temp_value=="1|1")] <- paste(alt,alt)
    temp <- rbind(temp,temp_value)
}
temp <- as.data.frame(temp)

cat("",file=paste0(out_name,g,".fasta"),append = FALSE)

for (i in 1:ncol(temp)) {
    c1 <- c()
    c2 <- c()
    for (j in 1:nrow(temp)) {
        c1 <- c(c1,strsplit(temp[j,i]," ")[[1]][1])
        c2 <- c(c2,strsplit(temp[j,i]," ")[[1]][2])
    }
    cat(paste(">",paste0(names(temp)[i],".1"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste0(paste(c1,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste(">",paste0(names(temp)[i],".2"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste0(paste(c2,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
}

fasta
dnasp
输入fasta后,Generate 选Haplotype 生成单倍型,得到单倍型文件(nex)
结果
这里一共是得到315个单倍型,但大部分都只有1个样本。
由于需要统计样本的信息,这里把nex的freq字段另存为test.txt(),用来做统计,并生成TraitLabels字段,这里需要样本的分类信息plink_cluster.txt。

freq字段

clust <- read.table("plink_cluster.txt",header = T,sep = "\t")
hap <- read.table("test.txt",header = F,sep = ":")
hap[,1] <- gsub("\\[","",hap[,1])
hap[,2] <- gsub("\\]","",hap[,2])

clut_hap <- clust[,2:3]

tem <- as.data.frame(strsplit(hap[,2],"    "))[2,]
hap[,3] <-t(tem)
hap[,3] <- gsub("\\.1","",hap[,3])
hap[,3] <- gsub("\\.2","",hap[,3])

for (i in 1:nrow(clust)) {
    flag <- 0
    v1 <- 0
    v2 <- 0
    for (j in 1:nrow(hap)) {
        if (flag==2) break
        if(clust[i,1] %in% strsplit(hap[j,3]," ")[[1]]){
            idx <- which(strsplit(hap[j,3]," ")[[1]] %in% clust[i,1])
            if(length(idx)==1&flag==0) {v1 <- hap[j,1];flag <- flag+1;next}
            if(length(idx)==1&flag==1) {v2 <- hap[j,1];flag <- flag+1;next}
            if(length(idx)==2&flag==0) {v1 <- hap[j,1];v2 <- hap[j,1];next}
        }
    }
    clut_hap[i,3] <- v1
    clut_hap[i,4] <- v2
}

write.table(clut_hap,file=paste0(out_name,g,"_clut_hap.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE) #保存结果
names(clut_hap) <- c("ID","group"," "," ")
tem <- rbind(clut_hap[,c(3,2)],clut_hap[,c(4,2)])
tem <- tem[!is.na(tem[,2]),]
temp <- table(tem, exclude = 0)

out_matrix <- data.frame(matrix(temp,nrow = length(row.names(temp))))
row.names(out_matrix) <- row.names(temp) 
names(out_matrix) <- gsub(" ","_",colnames(temp))

out_matrix$idx <- as.numeric(gsub("Hap_","",row.names(out_matrix)))
out_matrix <- out_matrix[order(out_matrix$idx),]
out_matrix <- out_matrix[,-ncol(out_matrix)]

cat("\n",file=paste0(out_name,g,".prenex"),append = FALSE)
cat("BEGIN TRAITS;\n",file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Dimensions NTRAITS=",ncol(out_matrix),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Format labels=yes missing=? separator=Tab",";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("TraitLabels\t",paste(names(out_matrix),collapse = "\t"),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Matrix","\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
write.table(out_matrix,file=paste0(out_name,g,".prenex"),sep = "\t",append = TRUE,row.names = TRUE, col.names = F, quote = FALSE)

生成文件
把生成的文件全部内容贴到dnasp生成nex文件底部。
但我在用popart的时后,文件还是导不进去。发现需要在nex文件删掉一段。
前

后
然后就用popart可视化就好了。

315个单倍型网络图
这个太多了,手动去了下,只保留17个单倍型(修改TAXA,CHARACTERS,TRAITS三个字段)。

17个单倍型网络
也可以把分类信息改成地理信息
单倍型地理分布图
我发现popart计算出的中性突变系数和Tassel是一样的。
单倍型统计

都有单倍型信息了也顺便做下方差分析好了。
这里用RFGB下载到的GL表型试一下。
正常的关联分析应该用的是混合线性模型。
Statistical methods of gcHap-based GWAS
但这里只有一个基因上的单倍型信息,用整体的协方差或亲缘关系矩阵会把模型变复杂,个人认为还是用方差分析比较合适。(这方法有待商榷)

pheno <- read.table("grain_length.txt",header = F,sep = "\t") #表型文件
genotype <- read.table(paste0(out_name,g,"_clut_hap.txt"),header = F,sep = "\t") #单倍型文件
thr <- 50 #去掉样本少的单倍型
tem <- merge(genotype[,c(1,4)],pheno,by=c("V1"))
colnames(tem) <- c("V1","V3","V2")
tem <- rbind(tem, merge(genotype[,c(1,3)],pheno,by=c("V1")))
tem <- merge(tem,genotype[,c(1,2)],by= c("V1"))
colnames(tem) <- c("ID","hap","pheno","cluster")
freq <- as.data.frame(table(tem$hap))
pass_filt <-  as.character(freq[freq[,2]>=thr,1])
data <- tem[which(tem$hap %in% pass_filt),]

library(car)
library(agricolae)
mod <- aov(pheno~factor(hap),data) #方差分析
Anova(mod,type=3)
oneway.test(pheno~factor(hap),data,var.equal = F)#方差不齐的方差分析
out <- scheffe.test(mod,"factor(hap)") #可以用于非平衡数据的多重比较
out$groups 
plot(out) 

方差分析结果
简单的方差分析可得hap_101的粒长较长。

#做个总结
上述主要是提供了一个在所有7.1m的SNP数据集中,提取部分位点的标记,并计算LD和一些群体遗传指标,基本上是R+plink就解决了。
当然如果手头上有一些品种的重测序数据,也可以和3k水稻的SNP数据集放到一起进行比对,相信这个数据集还有很多信息没有被挖掘,共勉。

#R代码汇总
提取位点,群体指标计算

#基本参数
gene_list <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene_len.txt"  #目标区间
out_name <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene" #输出文件前缀
data_name <- c(".\\data\\base_filtered_v0.7",".\\data\\Nipponbare_indel") #SNP数据集,输入多个就是合并
pop_name<-c("GJ","XI","admix","aus","bas","GJ XI","admix aus bas") #需要生成的群体
fst_pop <- c("GJ XI","admix aus bas") #计算fst需要的群体 需在pop_name出现过
#质控参数
maf <- 0.05 
miss <- 0.4

color.rgb <- colorRampPalette(rev(c("white","red")),space="rgb") #color for LD

target_gene <- read.table(gene_list,header = T)
extract <- cbind(target_gene $CHROM,target_gene $start,target_gene $end,target_gene $GENE)
write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)

###
#提取目标区域位点 并输出vcf文件 可以用Tassel 看LD
#查看样本是否缺少
tem <- c()
for (i in 1:length(data_name)) {
    temp <- read.table(paste0(data_name[i],".fam"),header = F)
    temp <- nrow(temp)
    tem <- c(tem,temp)
}
#用少了样本的数据集来keep
keep_file <- data_name[which(tem %in% min(tem))]

for (i in 1:length(data_name)) {
    total_cmd <- paste0(".//plink --bfile ",data_name[i] ," --extract temp_extract.txt --range --keep ",paste0(keep_file,".fam")," --recode vcf-iid --out ",paste0(out_name,"_",i))
    system(total_cmd,intern = FALSE)
}
temp_all <- read.table(paste0(out_name,"_",1,".vcf"),header = F,sep = '\t')

if (length(data_name)>=2){    #合并数据集
    for (i in 2:length(data_name)) {
        temp <- read.table(paste0(out_name,"_",i,".vcf"),header = F,sep = '\t')
        temp_all<-rbind(temp_all ,temp )
    }
    temp_all<-temp_all[order(temp_all[,1],temp_all[,2]),] #按照染色体和POS排序
}

### 列名
fam_name<-read.table(paste0(keep_file,".fam") ,header = F,sep = ' ')[,1]
vcf_colnames<-c("#CHROM",	"POS","ID","REF","ALT","QUAL","FILTER","INFO","FORMAT")
vcf_colnames<-c(vcf_colnames,fam_name)
###输出结果,vcf
names(temp_all)<-vcf_colnames
write.table(temp_all,file=paste0(out_name,"_","target_all.vcf"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE)
###筛选
total_cmd <- paste0("plink --vcf ",paste0(out_name,"_","target_all.vcf")," --const-fid --maf ",maf," --geno ",miss," --make-bed --out ",paste0(out_name,"_","target_all")) #直接用vcf筛选会有点问题,还是再转下bed格式
system(total_cmd,intern = FALSE)

temp <- read.table(paste0(out_name,"_","target_all",".fam"),header = F)
temp[,1] <- temp[,2]
write.table(temp,file=paste0(out_name,"_","target_all",".fam"),sep = " ",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE)
total_cmd <- paste0(".//plink --bfile ",paste0(out_name,"_","target_all") ," --recode vcf-iid --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

#不同群体提取 按基因提取 
for(g in target_gene$GENE){
    extract <-dplyr::filter(target_gene ,GENE==g)
    extract <- cbind(extract$CHROM,extract$start,extract$end,extract$GENE)
    write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range "," --recode --make-bed --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)

    for (p in pop_name) {
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
###        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode --make-bed --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
###       system(total_cmd,intern = FALSE)
    }
}

for (p in pop_name) {
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --within plink_cluster.txt  --keep-cluster-names ",p,"  --make-bed  --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}
###
library(LDheatmap)
library(genetics)

plot_LD <- function(vcf_file_name,title){
    temp_vcf <- as.data.frame(data.table::fread(vcf_file_name,header = T,sep = '\t'))
    if(nrow(temp_vcf )<=1) {print(paste0(vcf_file_name," have no snp&indel"));return()}###没有就结束

    temp_snp <- temp_vcf[0,]
    for (i in 1:nrow(temp_vcf)) {
        temp <- temp_vcf[i,]
        for(j in 10:ncol(temp_vcf)){
            if(temp_vcf[i,j]=="0/0"){
                temp[1,j] <- paste0(temp[1,4],"/",temp[1,4])
                next
            }
            if(temp_vcf[i,j]=="0/1"){
                temp[1,j] <- paste0(temp[1,4],"/",temp[1,5])
                next
            }
            if(temp_vcf[i,j]=="1/1"){
                temp[1,j] <- paste0(temp[1,5],"/",temp[1,5])
                next
            }
            if(temp_vcf[i,j]=="./."){
                temp[1,j] <- NA
                next
            }
        }
        temp_snp <- rbind(temp_snp,temp)
    }

    SNPpos <- temp_snp$POS
    SNP <- as.data.frame(t(temp_snp[,10:ncol(temp_snp)]))

    ### 转换成LDheatmap能用的格式
    for(i in 1:ncol(SNP)){
        SNP[,i]<-as.genotype(SNP[,i])
    }
    ###画图
    LD <- LDheatmap(SNP, SNPpos,flip=TRUE,color=color.rgb(20),title = title,SNP.name=NULL)
}

for(g in target_gene$GENE){
    plot_LD(paste0(out_name,"_","target_all_",g,".vcf"),g)
    for (p in pop_name) {
        plot_LD(paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p),".vcf"),paste0(g,"_",gsub(" ","_",p)))
    }
}
###
total_pop<-length(fst_pop)

total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --fst --within plink_cluster.txt","  --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

###
for(p in fst_pop){
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --fst --within plink_cluster.txt  --keep-cluster-names ",p," --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

temp_table<-read.table(paste0(out_name,"_","target_all",".fst"),header=T)
fst <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
fst[,1] <- temp_table$FST

for (i in 1:total_pop) {
    temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",fst_pop[i]),".fst"),header=T)
    fst[,i+1] <- temp$FST
}

temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,fst)

names(temp) <- c("CHR","POS","fst",paste("fst",gsub(" ","_",fst_pop),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".fst.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存结果

###
total_pop<-length(pop_name)

total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --freq --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

###
for(p in pop_name){
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --freq --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

temp_table<-read.table(paste0(out_name,"_","target_all",".frq"),header=T)
pi <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
pi[,1] <- 2*temp_table$MAF*(1-temp_table$MAF)*temp_table$NCHROBS/(temp_table$NCHROBS-1)

for (j in 1:total_pop) {
    temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",pop_name[j]),".frq"),header=T)
    pi[,j+1] <- 2*temp$MAF*(1-temp$MAF)*temp$NCHROBS/(temp$NCHROBS-1)
}

temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,pi)

names(temp) <- c("CHR","POS","pi",paste("pi",gsub(" ","_",pop_name),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".pi"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存结果
###

vcf to fasta

phase_file <- "D:\\beagle\\phased.vcf.vcf"
g <- g

vcf <- data.table::fread(phase_file,header = T)

temp <- c()
for (i in 1:nrow(vcf)) {
    temp_value <- vcf[i,10:ncol(vcf)]
    ref <- vcf[i,4]
    alt <- vcf[i,5]
    if(nchar(alt)>nchar(ref)){
    alt <- substr(alt,2,nchar(alt))
    ref <- paste(rep("-",nchar(alt)),collapse = "")
    }
    if(nchar(ref)>nchar(alt)){
        ref <- substr(ref,2,nchar(ref))
        alt <- paste(rep("-",nchar(ref)),collapse = "")
    }
    temp_value[,which(temp_value=="0|0")] <- paste(ref,ref)
    temp_value[,which(temp_value=="0|1")] <- paste(ref,alt)
    temp_value[,which(temp_value=="1|0")] <- paste(alt,ref)
    temp_value[,which(temp_value=="1|1")] <- paste(alt,alt)
    temp <- rbind(temp,temp_value)
}
temp <- as.data.frame(temp)

cat("",file=paste0(out_name,g,".fasta"),append = FALSE)

for (i in 1:ncol(temp)) {
    c1 <- c()
    c2 <- c()
    for (j in 1:nrow(temp)) {
        c1 <- c(c1,strsplit(temp[j,i]," ")[[1]][1])
        c2 <- c(c2,strsplit(temp[j,i]," ")[[1]][2])
    }
    cat(paste(">",paste0(names(temp)[i],".1"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste0(paste(c1,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste(">",paste0(names(temp)[i],".2"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste0(paste(c2,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
}

nex traits 文件生成

clust <- read.table("plink_cluster.txt",header = T,sep = "\t")
hap <- read.table("test.txt",header = F,sep = ":")
hap[,1] <- gsub("\\[","",hap[,1])
hap[,2] <- gsub("\\]","",hap[,2])

clut_hap <- clust[,2:3]

tem <- as.data.frame(strsplit(hap[,2],"    "))[2,]
hap[,3] <-t(tem)
hap[,3] <- gsub("\\.1","",hap[,3])
hap[,3] <- gsub("\\.2","",hap[,3])

for (i in 1:nrow(clust)) {
    flag <- 0
    v1 <- 0
    v2 <- 0
    for (j in 1:nrow(hap)) {
        if (flag==2) break
        if(clust[i,1] %in% strsplit(hap[j,3]," ")[[1]]){
            idx <- which(strsplit(hap[j,3]," ")[[1]] %in% clust[i,1])
            if(length(idx)==1&flag==0) {v1 <- hap[j,1];flag <- flag+1;next}
            if(length(idx)==1&flag==1) {v2 <- hap[j,1];flag <- flag+1;next}
            if(length(idx)==2&flag==0) {v1 <- hap[j,1];v2 <- hap[j,1];next}
        }
    }
    clut_hap[i,3] <- v1
    clut_hap[i,4] <- v2
}

write.table(clut_hap,file=paste0(out_name,g,"_clut_hap.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE) #保存结果
names(clut_hap) <- c("ID","group"," "," ")
tem <- rbind(clut_hap[,c(3,2)],clut_hap[,c(4,2)])
tem <- tem[!is.na(tem[,2]),]
temp <- table(tem, exclude = 0)

out_matrix <- data.frame(matrix(temp,nrow = length(row.names(temp))))
row.names(out_matrix) <- row.names(temp) 
names(out_matrix) <- gsub(" ","_",colnames(temp))

out_matrix$idx <- as.numeric(gsub("Hap_","",row.names(out_matrix)))
out_matrix <- out_matrix[order(out_matrix$idx),]
out_matrix <- out_matrix[,-ncol(out_matrix)]

cat("\n",file=paste0(out_name,g,".prenex"),append = FALSE)
cat("BEGIN TRAITS;\n",file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Dimensions NTRAITS=",ncol(out_matrix),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Format labels=yes missing=? separator=Tab",";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("TraitLabels\t",paste(names(out_matrix),collapse = "\t"),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Matrix","\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
write.table(out_matrix,file=paste0(out_name,g,".prenex"),sep = "\t",append = TRUE,row.names = TRUE, col.names = F, quote = FALSE)

方差分析

pheno <- read.table("grain_length.txt",header = F,sep = "\t")
genotype <- read.table(paste0(out_name,g,"_clut_hap.txt"),header = F,sep = "\t")
thr <- 50
tem <- merge(genotype[,c(1,4)],pheno,by=c("V1"))
colnames(tem) <- c("V1","V3","V2")
tem <- rbind(tem, merge(genotype[,c(1,3)],pheno,by=c("V1")))
tem <- merge(tem,genotype[,c(1,2)],by= c("V1"))
colnames(tem) <- c("ID","hap","pheno","cluster")
freq <- as.data.frame(table(tem$hap))
pass_filt <-  as.character(freq[freq[,2]>=thr,1])
data <- tem[which(tem$hap %in% pass_filt),]

library(car)
library(agricolae)
mod <- aov(pheno~factor(hap),data)
Anova(mod,type=3)
oneway.test(pheno~factor(hap),data,var.equal = F)
out <- scheffe.test(mod,"factor(hap)") #可以用于非平衡数据的多重比较
out$groups
plot(out) 
  游戏开发 最新文章
6、英飞凌-AURIX-TC3XX: PWM实验之使用 GT
泛型自动装箱
CubeMax添加Rtthread操作系统 组件STM32F10
python多线程编程:如何优雅地关闭线程
数据类型隐式转换导致的阻塞
WebAPi实现多文件上传,并附带参数
from origin ‘null‘ has been blocked by
UE4 蓝图调用C++函数(附带项目工程)
Unity学习笔记(一)结构体的简单理解与应用
【Memory As a Programming Concept in C a
上一篇文章      下一篇文章      查看所有文章
加:2021-09-18 10:33:04  更:2021-09-18 10:35:16 
 
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁

360图书馆 购物 三丰科技 阅读网 日历 万年历 2024年5日历 -2024/5/17 15:06:16-

图片自动播放器
↓图片自动播放器↓
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
图片批量下载器
↓批量下载图片,美女图库↓
  网站联系: qq:121756557 email:121756557@qq.com  IT数码