吉林大学电子校务平台通知爬虫

介于本人很少上校务平台,而许多通知都在上面发布,如报销通知,于是写了个爬虫(顺便完成小创)

之后会在修改,会有新消息提示,不过要放假后再说了

 

#coding = utf-8
import requests
from bs4 import BeautifulSoup
import re
searchId=input("请输入你要查询的信息:")
url="https://oa.jlu.edu.cn/defaultroot/PortalInformation!jldxList.action?searchId=%s&startPage="%searchId
name=[]
link=[]
date=[]
announcer=[]

index=[]

judge_text = requests.get("https://oa.jlu.edu.cn/defaultroot/PortalInformation!jldxList.action?searchId=%s"%searchId).text
num = re.findall(r'第&nbsp;<font color="red">.*?</font>/.*?&nbsp;页 共&nbsp;(.*?)&nbsp;条记录&nbsp;&nbsp;',judge_text,re.S)[0]
print("共%s条记录"%num)
pages=int(int(num)/30)+1
for i in range(pages):
    index.append(str(i+1))

for i in index:
    url_page=url+i
    request=requests.get(url_page)
    html=request.text
    soup=BeautifulSoup(html,'html.parser')
    notice_div=soup.find_all('div',{'class':'li rel'})
    name += [div.find('a',{'class':'font14'}).string for div in notice_div]
    link += [href.find('a',{'class':'font14'}).get('href') for href in notice_div]
    date += [span.find('span',{'class':'time'}).string for span in notice_div]
    announcer += [div.find('a',{'class':'column'}).string for div in notice_div]
    # print(name)
    # print(link)
    # print(date)
    # print(announcer)

for i in range(len(link)):
    link[i]="https://oa.jlu.edu.cn/defaultroot/"+link[i]

import pandas as pd
df=pd.DataFrame({'Name': name, 'URL':link, 'Date':date ,'Announcer': announcer})
df.head()
df.to_csv('%s通知.csv'%searchId)

 

注意:请务必使用校园网进行访问,否则可能会无法连接

 

点赞

发表评论