介于本人很少上校务平台,而许多通知都在上面发布,如报销通知,于是写了个爬虫(顺便完成小创)
之后会在修改,会有新消息提示,不过要放假后再说了
#coding = utf-8 import requests from bs4 import BeautifulSoup import re searchId=input("请输入你要查询的信息:") url="https://oa.jlu.edu.cn/defaultroot/PortalInformation!jldxList.action?searchId=%s&startPage="%searchId name=[] link=[] date=[] announcer=[] index=[] judge_text = requests.get("https://oa.jlu.edu.cn/defaultroot/PortalInformation!jldxList.action?searchId=%s"%searchId).text num = re.findall(r'第 <font color="red">.*?</font>/.*? 页 共 (.*?) 条记录 ',judge_text,re.S)[0] print("共%s条记录"%num) pages=int(int(num)/30)+1 for i in range(pages): index.append(str(i+1)) for i in index: url_page=url+i request=requests.get(url_page) html=request.text soup=BeautifulSoup(html,'html.parser') notice_div=soup.find_all('div',{'class':'li rel'}) name += [div.find('a',{'class':'font14'}).string for div in notice_div] link += [href.find('a',{'class':'font14'}).get('href') for href in notice_div] date += [span.find('span',{'class':'time'}).string for span in notice_div] announcer += [div.find('a',{'class':'column'}).string for div in notice_div] # print(name) # print(link) # print(date) # print(announcer) for i in range(len(link)): link[i]="https://oa.jlu.edu.cn/defaultroot/"+link[i] import pandas as pd df=pd.DataFrame({'Name': name, 'URL':link, 'Date':date ,'Announcer': announcer}) df.head() df.to_csv('%s通知.csv'%searchId)
注意:请务必使用校园网进行访问,否则可能会无法连接