栏目分类:
子分类:
返回
文库吧用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
文库吧 > IT > 软件开发 > 后端开发 > Python

WebMagic简单使用

Python 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

WebMagic简单使用

Java爬虫有很多,WebMagic是其中一个,文档齐全,入门简单,个人用来爬取一些小数据很不错,以下以爬取彩票开奖结果为例,介绍一下基本用法。

WebMagic官网文档Introduction · WebMagic documents,文档很细致,通过实例介绍了一个完整的爬取过程,并持久化爬取结果。

WebMagic封装的很好,一般来说我们只用定义自己的PageProcessor(用于提取数据),Pipeline(用于处理提取的数据,如持久化)

下面依葫芦画瓢,我们来爬取彩票的开奖结果,以下内容仅限个人学习使用

需求:爬取彩票的开奖结果,并写入数据库

我们基于springboot框架开始,springboot可以方便的执行定时爬取,结合mybatis把数据写入数据库

----------------------------我是分割线---------------------------

爬取的源:体彩官网(中国体彩网_国家体育总局体育彩票管理中心官方网站),500彩票(彩票开奖结果查询_彩票开奖号码公告_彩票开奖时间 - 500彩票网),新浪爱彩(【彩票开奖】彩票开奖结果_最新全国体彩,福彩,快彩开奖查询_新浪爱彩)

体彩的玩法有:大乐透,7星彩,排列3,排列5

打开体彩官网,首页可以看到最近一期各种玩法的开奖结果

 

 

 我们点击前面的各个玩法,可以进去看详情,至于为何要点击进去,作为初学者,单个玩法单独处理可能会简单明了

打开大乐透详情页面(超级大乐透_中国体彩网),chrome浏览器按F12打开开发工具,刷新页面,看看请求的过程

逐一观察请求,发现这个请求

这个请求返回的是JSON,数据完全符合我们的需求,直接利用就好了

这时你可能会质疑为啥你要先看请求的过程,而不是分析页面的内容,其实在找到这个请求之前,我也分析过页面,页面的源码中并没有开奖的数据,所有我断定数据是通过后加载的方式填入页面的,想到这里当然要看请求咯

JSON数据最好不过了,反解析后直接使用,少了在HTML中提取数据的过程,核心代码如下:

定义TcOrgProcessor类,写如何提取我们需要的数据

 

public class TcOrgProcessor implements PageProcessor {
    private final Logger logger = LoggerFactory.getLogger(TcOrgProcessor.class);
    private static final DateTimeFormatter DATE_FORMAT = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss");

    public static final String DLT_URL = "https://webapi.sporttery.cn/gateway/lottery/getDigitalDrawInfoV1.qry?param=85,0&isVerify=1";
    public static final String QXC_URL = "https://webapi.sporttery.cn/gateway/lottery/getDigitalDrawInfoV1.qry?param=04,0&isVerify=1";
    public static final String PL5_URL = "https://webapi.sporttery.cn/gateway/lottery/getDigitalDrawInfoV1.qry?param=35,0;350133,0&isVerify=1";

    private final Site site = Site.me();

    @Override
    public void process(Page page) {
        String url = page.getUrl().toString();
        //预处理返回结果
        String text = page.getRawText();
        JSonObject rootInfo = JSON.parseObject(text);
        if (rootInfo.getIntValue("errorCode") != 0) {
            page.setSkip(true);
            logger.error("请求结果错误,URL=>{},内容=>{}", url, text);
            return;
        }
        //读取value字段
        JSonObject valueObject = rootInfo.getJSonObject("value");
        //大乐透,七星彩,排列的开奖结果模型相同,只是字段不同
        try {
            String[] keys = new String[] { "dlt", "qxc", "plw", "pls" };
            List drawInfos = new ArrayList<>();
            for (String key : keys) {
                JSonObject drawInfoObject = valueObject.getJSonObject(key);
                if (drawInfoObject == null || drawInfoObject.isEmpty())
                    continue;
                //处理结果
                String gameName = drawInfoObject.getString("lotteryGameName");
                String drawNum = drawInfoObject.getString("lotteryDrawNum");
                String strDrawResult = drawInfoObject.getString("lotteryDrawResult").replaceAll(" ", ",");
                LocalDate drawDate = LocalDate.parse(drawInfoObject.getString("lotteryDrawTime"), DATE_FORMAT);
                String poolBalance = drawInfoObject.getString("poolBalanceAfterdraw").replaceAll(",", "");
                //构造开奖信息模型
                List drawResult = Arrays.stream(strDrawResult.split(",")).map(Integer::valueOf).collect(Collectors.toList());
                int poolIntValue = new BigDecimal(poolBalance).intValue();
                DrawInfo drawInfo = new DrawInfo(gameName, drawNum, drawDate, drawResult, poolIntValue, Source.TC_ORG);
                drawInfos.add(drawInfo);
            }
            //存入结果集
            page.putField("results", drawInfos);
        } catch (Exception e) {
            logger.error("解析异常:{}", e.getMessage());
            page.setSkip(true);
        }
    }

    @Override
    public Site getSite() {
        return site;
    }

    public static void main(String[] args) {
        Spider.create(new TcOrgProcessor()).addUrl(DLT_URL).addUrl(QXC_URL).addUrl(PL5_URL).addPipeline(new ConsolePipeline()).run();
    }

 

DrawInfo类是开奖信息模型,我们将提取的数据,标准化成这个模型,方便在后续的Pipeline中使用

public class DrawInfo {
    //游戏名称[大乐透,7星彩,排列5]
    private String game;
    //期号[21035]
    private String expect;
    //开奖日期[2021-03-15]
    private LocalDate drawDate;
    //开奖结果[1,2,3,4,5]
    private List drawResult;
    //奖池[188827520]
    private int poolBalance;
    //采集来源
    private Source source;

    public String getGame() {
        return game;
    }

    public void setGame(String game) {
        this.game = game;
    }

    public String getExpect() {
        return expect;
    }

    public void setExpect(String expect) {
        this.expect = expect;
    }

    public LocalDate getDrawDate() {
        return drawDate;
    }

    public void setDrawDate(LocalDate drawDate) {
        this.drawDate = drawDate;
    }

    public List getDrawResult() {
        return drawResult;
    }

    public void setDrawResult(List drawResult) {
        this.drawResult = drawResult;
    }

    public int getPoolBalance() {
        return poolBalance;
    }

    public void setPoolBalance(int poolBalance) {
        this.poolBalance = poolBalance;
    }

    public Source getSource() {
        return source;
    }

    public void setSource(Source source) {
        this.source = source;
    }

    public DrawInfo(String game, String expect, LocalDate drawDate, List drawResult, int poolBalance, Source source) {
        this.game = game;
        this.expect = expect;
        this.drawDate = drawDate;
        this.drawResult = drawResult;
        this.poolBalance = poolBalance;
        this.source = source;
    }

    @Override
    public String toString() {
        return "DrawInfo{" + "game='" + game + ''' + ", expect='" + expect + ''' + ", drawDate=" + drawDate + ", drawResult=" + drawResult
                + ", poolBalance=" + poolBalance + ", source=" + source + '}';
    }

    @Override
    public boolean equals(Object o) {
        if (this == o)
            return true;
        if (o == null || getClass() != o.getClass())
            return false;
        DrawInfo drawInfo = (DrawInfo) o;
        return Objects.equals(game, drawInfo.game) && Objects.equals(expect, drawInfo.expect) && Objects.equals(drawDate, drawInfo.drawDate)
                && Objects.equals(drawResult, drawInfo.drawResult);
    }

    @Override
    public int hashCode() {
        return Objects.hash(game, expect, drawDate, drawResult);
    }
}

现在我们已经爬取到了需要的数据,自定义Pipeline可以自己处理爬取的结果

@Component
public class DrawResultPipeline implements Pipeline {
    private final Logger logger = LoggerFactory.getLogger(DrawResultPipeline.class);

 
    @Override
    public synchronized void process(ResultItems resultItems, Task task) {
        Map map = resultItems.getAll();
        logger.info("爬取数据结果:{}", map);
        //noinspection unchecked
        List results = (List) map.get("results");
        //TODO: 持久化到数据库
    }
}

为了能及时的获取最新的数据,我们设置一个定时任务,每间隔一段时间爬取一次

在springboot中可以很容易实现定时任务(百度搜索:springboot定时任务)

@Component
public class SchedulerTask {
    private final Logger logger = LoggerFactory.getLogger(SchedulerTask.class);
    //注入自定义的Pipeline,传给WebMagic的Spider
    @Resource private DrawResultPipeline drawResultPipeline;

    
    @Scheduled(cron = "0 0/2 8-23 * * ?")
    public void fetch() throws Exception {
        Spider.create(new TcOrgProcessor()).setExitWhenComplete(true).addPipeline(drawResultPipeline).start();
       //TODO: 添加其他源的爬虫  
    }
}

至此爬取,持久化的流程就结束了。

其他源只是PageProcessor不同,持久化的过程是相同的,所以只用写对应的PageProcessor即可,完成后PageProcessor后添加到定时任务即可定时爬取

 500网的PageProcessor

public class WubaiProcessor implements PageProcessor {
    private final Logger logger = LoggerFactory.getLogger(WubaiProcessor.class);
    private static final DateTimeFormatter DATE_FORMAT = DateTimeFormatter.ofPattern("yyyy-MM-dd");

    public final static String START_URL = "http://kaijiang.500.com";

    private final Site site = Site.me();
    @Override
    public void process(Page page) {
        //开奖表的根节点
        Selectable rootNode = page.getHtml().xpath("//table[@class=kj_tablelist01]/tbody");
        List drawInfos = new ArrayList<>();
        //大乐透
        try {
            Selectable dltNode = rootNode.xpath("//tr[@id=dlt]");
            String drawNum = dltNode.xpath("//td[2]/text()").replace("期", "").toString().trim();
            String strDrawDate = dltNode.xpath("//td[3]/text()").toString().trim();
            LocalDate drawDate = LocalDate.parse(strDrawDate, DATE_FORMAT);
            String strDrawResult = dltNode.xpath("//td[4]/script").regex("formatResult\('dlt','(.*)'\)", 1).toString().trim();
            strDrawResult = strDrawResult.replace("|", ",");
            String poolBalance = dltNode.xpath("//td[5]/script").regex("formatCCMoney\('dlt','(.*)'\)", 1).toString().trim();
            logger.info("大乐透:{}, {}, {}, {}", drawNum, drawDate, strDrawResult, poolBalance);
            //构造开奖对象
            List drawResult = Arrays.stream(strDrawResult.split(",")).map(Integer::valueOf).collect(Collectors.toList());
            int poolIntValue = new BigDecimal(poolBalance).intValue();
            DrawInfo dltInfo = new DrawInfo("大乐透", drawNum, drawDate, drawResult, poolIntValue, Source.WUBAI_COM);
            drawInfos.add(dltInfo);
        } catch (Exception e) {
            logger.error("大乐透解析页面异常:{}", e.getMessage());
        }
        //7星彩
        try {
            Selectable qxcNode = rootNode.xpath("//tr[@id=qxc]");
            String drawNum = qxcNode.xpath("//td[2]/text()").replace("期", "").toString().trim();
            String strDrawDate = qxcNode.xpath("//td[3]/text()").toString().trim();
            LocalDate drawDate = LocalDate.parse(strDrawDate, DATE_FORMAT);
            String strDrawResult = qxcNode.xpath("//td[4]/script").regex("formatResult\('qxc','(.*)'\)", 1).toString().trim();
            String poolBalance = qxcNode.xpath("//td[5]/script").regex("formatCCMoney\('qxc','(.*)'\)", 1).toString().trim();
            logger.info("7星彩:{}, {}, {}, {}", drawNum, drawDate, strDrawResult, poolBalance);
            //构造开奖对象
            List drawResult = Arrays.stream(strDrawResult.split(",")).map(Integer::valueOf).collect(Collectors.toList());
            int poolIntValue = new BigDecimal(poolBalance).intValue();
            DrawInfo qxcInfo = new DrawInfo("7星彩", drawNum, drawDate, drawResult, poolIntValue, Source.WUBAI_COM);
            drawInfos.add(qxcInfo);
        } catch (Exception e) {
            logger.error("7星彩解析页面异常:{}", e.getMessage());
        }
        //排列5
        try {
            Selectable plwNode = rootNode.xpath("//tr[@id=plw]");
            String drawNum = plwNode.xpath("//td[2]/text()").replace("期", "").toString().trim();
            String strDrawDate = plwNode.xpath("//td[3]/text()").toString().trim();
            LocalDate drawDate = LocalDate.parse(strDrawDate, DATE_FORMAT);
            String strDrawResult = plwNode.xpath("//td[4]/script").regex("formatResult\('plw','(.*)'\)", 1).toString().trim();
            String poolBalance = plwNode.xpath("//td[5]/script").regex("formatCCMoney\('plw','(.*)'\)", 1).toString().trim();
            logger.info("排列5:{}, {}, {}, {}", drawNum, drawDate, strDrawResult, poolBalance);
            //构造开奖对象
            List drawResult = Arrays.stream(strDrawResult.split(",")).map(Integer::valueOf).collect(Collectors.toList());
            int poolIntValue = new BigDecimal(poolBalance).intValue();
            DrawInfo plwInfo = new DrawInfo("排列5", drawNum, drawDate, drawResult, poolIntValue, Source.WUBAI_COM);

            drawInfos.add(plwInfo);
        } catch (Exception e) {
            logger.error("排列5解析页面异常:{}", e.getMessage());
        }
        //排列3
        try {
            Selectable plwNode = rootNode.xpath("//tr[@id=pls]");
            String drawNum = plwNode.xpath("//td[2]/text()").replace("期", "").toString().trim();
            String strDrawDate = plwNode.xpath("//td[3]/text()").toString().trim();
            LocalDate drawDate = LocalDate.parse(strDrawDate, DATE_FORMAT);
            String strDrawResult = plwNode.xpath("//td[4]/script").regex("formatResult\('pls','(.*)'\)", 1).toString().trim();
            String poolBalance = plwNode.xpath("//td[5]/script").regex("formatCCMoney\('pls','(.*)'\)", 1).toString().trim();
            logger.info("排列3:{}, {}, {}, {}", drawNum, drawDate, strDrawResult, poolBalance);
            //构造开奖对象
            List drawResult = Arrays.stream(strDrawResult.split(",")).map(Integer::valueOf).collect(Collectors.toList());
            int poolIntValue = new BigDecimal(poolBalance).intValue();
            DrawInfo plsInfo = new DrawInfo("排列3", drawNum, drawDate, drawResult, poolIntValue, Source.WUBAI_COM);

            drawInfos.add(plsInfo);
        } catch (Exception e) {
            logger.error("排列3解析页面异常:{}", e.getMessage());
        }
        page.putField("results", drawInfos);
    }

    @Override
    public Site getSite() {
        return site;
    }

    public static void main(String[] args) {
        Spider.create(new WubaiProcessor()).addUrl(START_URL).run();
    }
}

至此,我们使用WebMagic得到了想要的数据,持久化到数据库的示例

欢迎学习交流

 

 

转载请注明:文章转载自 www.wk8.com.cn
本文地址:https://www.wk8.com.cn/it/280331.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 wk8.com.cn

ICP备案号:晋ICP备2021003244-6号