上文讲解一篇简单的单网页下载图片方法。但是实际中的网站,一般都是有多级结构的。如果存在二级网页,可以使用Java中的多线程技术来优化代码。具体来说,可以创建一个线程池,将每个链接的爬取任务提交到线程池中执行,以提高程序的效率和性能。
以下是使用Java语言实现的示例代码:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.*;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.Objects;
import java.util.UUID;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
public class WebCrawler {
private static final int MAX_DEPTH = 2; // 最大爬取深度
private static final int NUM_THREADS = 10; // 线程池大小
public static void main(String[] args) throws InterruptedException {
String startUrl = "https://www.example.com"; // 起始URL,修改为你的地址
ExecutorService executor = Executors.newFixedThreadPool(NUM_THREADS); // 创建线程池
List<String> urlsToCrawl = new ArrayList<>();
urlsToCrawl.add(startUrl);
for (int i = 0; i < MAX_DEPTH; i++) {
List<String> nextUrlsToCrawl = new ArrayList<>();
for (String url : urlsToCrawl) {
executor.submit(() -> {
try {
// 下载当前页面的图片
downloadImageFromUrl(url);
// 爬取当前URL页面上的所有链接
List<String> links = getLinks(url);
links.stream().forEach(System.out::println);
for (String link : links) {
if (!nextUrlsToCrawl.contains(link)) {
nextUrlsToCrawl.add(link);
}
}
} catch (IOException e) {
e.printStackTrace();
}
});
}
urlsToCrawl = nextUrlsToCrawl;
//暂停几秒,以免给网站造成压力
TimeUnit.SECONDS.sleep(5);
}
executor.shutdown(); // 关闭线程池
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS); // 等待所有任务完成
}
private static void downloadImageFromUrl(String url) {
String html = fetchHtml(url);
List<String> imageUrls = extractImageUrls(html);
for (int i = 0; i < imageUrls.size(); i++) {
String imgUrl = imageUrls.get(i);
String fileName = "image-" + UUID.randomUUID() + ".jpg"; // 根据需要修改文件名格式
downloadImage(imgUrl, fileName);
//暂停几秒,以免给网站造成压力
try {
TimeUnit.SECONDS.sleep(1);
} catch (InterruptedException e) {
System.out.println(e.getMessage());
}
}
}
public static void downloadImage(String imgUrl, String fileName) {
try {
URL url = new URL(imgUrl);
InputStream in = url.openStream();
OutputStream out = new BufferedOutputStream(new FileOutputStream(fileName));
byte[] buffer = new byte[1024];
int bytesRead;
while ((bytesRead = in.read(buffer)) != -1) {
out.write(buffer, 0, bytesRead);
}
in.close();
out.close();
} catch (IOException e) {
e.printStackTrace();
}
}
private static List<String> getLinks(String url) throws IOException {
// 获取当前URL页面上的所有链接
String html = fetchHtml(url);
String protocol = url.split(":")[0];
return extractUrls(html, protocol);
}
public static String fetchHtml(String url) {
try {
Document document = Jsoup.connect(url).get();
return document.html();
} catch (IOException e) {
e.printStackTrace();
return null;
}
}
public static List<String> extractUrls(String html, String protocol) {
List<String> imageUrls = new ArrayList<>();
Document document = Jsoup.parse(html);
Elements imgElements = document.select("a");
for (Element imgElement : imgElements) {
String imgUrl = imgElement.absUrl("href");
if (Objects.nonNull(imgUrl) && imgUrl.trim() != "") {
if (imgUrl.toLowerCase().startsWith("http")) {
imageUrls.add(imgUrl);
} else {
imageUrls.add(protocol + ":" + imgUrl);
}
}
}
return imageUrls;
}
public static List<String> extractImageUrls(String html) {
List<String> imageUrls = new ArrayList<>();
Document document = Jsoup.parse(html);
Elements imgElements = document.select("img");
for (Element imgElement : imgElements) {
String imgUrl = imgElement.absUrl("src");
if (Objects.nonNull(imgUrl) && imgUrl.toLowerCase().startsWith("http")) {
imageUrls.add(imgUrl);
}
}
return imageUrls;
}
}
在上述代码中,我们首先定义了最大爬取深度和线程池大小等参数。然后,我们创建了一个线程池和一个用于存储待爬取URL的列表。接着,我们使用两层循环遍历所有待爬取的URL,并将每个URL的爬取任务提交到线程池中执行。最后,我们关闭线程池并等待所有任务完成。