在做爬蟲的時(shí)候發(fā)現(xiàn)如果不設(shè)置請(qǐng)求頭的話影兽,每次httpclient發(fā)起的請(qǐng)求都是響應(yīng)移動(dòng)端板式的尝艘,無法抓取響應(yīng)的內(nèi)容,后面查了一下才知道,需要重新設(shè)置請(qǐng)求頭产园,讓服務(wù)端誤以為是某個(gè)真實(shí)瀏覽器發(fā)起的請(qǐng)求:
HttpClient httpClient = new DefaultHttpClient();
//創(chuàng)建一個(gè)httpGet方法
HttpGet httpGet = new HttpGet("xxxxx");
//設(shè)置httpGet的頭部參數(shù)信息
httpGet.setHeader("Accept", "Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
httpGet.setHeader("Accept-Charset", "GB2312,utf-8;q=0.7,*;q=0.7");
httpGet.setHeader("Accept-Encoding", "gzip, deflate");
httpGet.setHeader("Accept-Language", "zh-cn,zh;q=0.5");
httpGet.setHeader("Connection", "keep-alive");
httpGet.setHeader("Cookie", "");
httpGet.setHeader("Host", "");
httpGet.setHeader("refer", "");
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2");
最基本的爬蟲就用了幾句代碼就完成了,附上爬蟲代碼橙凳,主要是用到HttpClient丘损、webmagic、jsoup
public class WebMagicService {
public static void main(String[] args) throws Exception{
HttpClient httpClient = HttpClients.createDefault();
HttpGet httpGet = new HttpGet("http://bbs.e763.com/");
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2");
HttpResponse response = httpClient.execute(httpGet);
String contents = EntityUtils.toString(response.getEntity(),"gbk");//utf-8
Document document = Jsoup.parse(contents);
Elements elements = document.select("div#hza11 div.boxtxthot a");
// System.out.println(contents);
for (Element element : elements) {
System.out.println(element.text()+ " : " + element.attr("href"));
}
}
}
爬取內(nèi)容如下:
三大運(yùn)營(yíng)商9月1日起取消漫游費(fèi) 用戶無需申請(qǐng)自動(dòng)生效 : viewthread.php?tid=907629
2年前月桂湖棄嬰案:被螞蟻咬過的孩子奕筐, 精神發(fā)育遲滯 : viewthread.php?tid=907607
.....