- 设置每个页面抓取时间间隔,降低被封概率
ThreadUtil.sleep(Long.parseLong(Math.random() * Integer.parseInt(LoadPropertyUtil.getConfig("millions_3"))+"")); - 购买IP库,随机获取IP抓取数据;将ip放入redis中,每次随机获取放入请求中,如果失效,则从redis库中删除
//下载页面
public static String getPageContent(String url){
//1. 创建HttpClient实例
HttpClientBuilder builder = HttpClients.custom();
//CloseableHttpClient client = builder.build();
CloseableHttpClient client = null;
/******************设置动态ip***********************/
//182.90.28.52:80
RedisUtil redisUtil = new RedisUtil();
//获取代理ip
String ip_port = redisUtil.getSet("proxy");
if(StringUtils.isNotBlank(ip_port)){
String[] arr = ip_port.split(":");
String proxy_ip=arr[0];
int proxy_port=Integer.parseInt(arr[1]);
//设置代理
HttpHost proxy = new HttpHost(proxy_ip, proxy_port);
client =builder.setProxy(proxy).build();
}
// 2. 根据URL创建HttpGet实例
HttpGet request = new HttpGet(url);
String content=null;
try {
request.setHeader("User-Agent",USER_AGENT);
// 执行get请求,得到返回体
CloseableHttpResponse response = client.execute(request);
// 实体是在执行包含内容的请求时创建的,或者是在请求成功并使用响应体将结果发送回客户机时创建的
HttpEntity entity = response.getEntity();
//将html网页转换为String
content = EntityUtils.toString(entity);
}catch (HttpHostConnectException e){
e.printStackTrace();
//如果当前ip不可用,从动态代理ip中删除
redisUtil.deleteSet("proxy",ip_port);
} catch (ClientProtocolException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}catch (IOException e) {
e.printStackTrace();
}
return content;
}
- 部署多个应用分别抓取,降低单节点频繁访问;单个结点轮询或随机抓取不同网站,以下是随机抓取不同视频网站,将顶级域名:url放入到redis中,每次从redis中随机抓取url
/**
* Redis url仓库实现类:随机取不同视频网站url,降低单个网站频繁访问
* @author 小新
* @create 2021/12/3- 11:37
*/
public class RandomRedisRepositoryService implements IRepositoryService {
//顶级域名:redisKey
HashMap<String,String> hashMap=new HashMap<>();//相当于索引
RedisUtil redisUtil=new RedisUtil();
Random random=new Random();
/**
* 随机获取url
* @return
*/
@Override
public String poll() {
String[] keyArr = hashMap.keySet().toArray(new String[0]);
int nextInt = random.nextInt(keyArr.length);
String key=keyArr[nextInt];
String value=hashMap.get(key);
return redisUtil.poll(value);
}
@Override
public void addHighLevel(String url) {
//获取顶级域名
String topDomain = MatchUtil.getTopDomain(url);
//根据顶级域名获取redis key
String redisKey = hashMap.get(topDomain);
if(redisKey==null){
redisKey=topDomain;
hashMap.put(topDomain,redisKey);
}
redisUtil.add(redisKey,url);
}
@Override
public void addLowLevel(String url) {
addHighLevel(url);
}
}
注:Redis 作为url仓库的好处 -
1.机器宕机后redis中会保存已有的数据 -
2.作为公共仓库,多个结点可以同时操作,负载均衡
|