Spider for Lianjia,to obtain information about second-hand housing in Shanghai.
Spider/
- directory for scrapy code and data.data/
- data obtained by scrapydata_with_coordinites.csv
- data with coordinateshtmls.csv
- all htmlsoriginal_data.csv
- data without coordinatessubway.csv
- subway informationurls.csv
- start urlsvalid_htmls.csv
- valid htmls
spiders/
- scrapy codecrawldata.py
- scrapy code, to obtain data from htmlsgeturl.py
- scrapy code, to obtain htmls from start urls
items.py
- scrapy project code, define itemsmiddlewares.py
- scrapy project code, define middlewarespipelines.py
- scrapy project code, define pipelinessettings.py
- scrapy project code, define settings
predict
- data preprocess and model traindata/
- data after preprocessout/
- output directorynn_pred.py
- predictionpreprocess.py
- data proprecessrun.py
- model definition and trainrun.sh
- run scripttitle_wordcloud.py
- make wordcloud for titlesword_embedding.py
- word embedding
utils/
- tool functionbaidu_get_LLitude.py
- get coordinates from baidu mapgaode_get_LLitude.py
- get coordinates from gaode maptencent_get_LLitude.py
- get coordinates from tencent mapBeautifulSoup.py
- BeautifulSoup crawler script, to obtain data from valid htmlsdel_invalid_urls.py
- delete invalid html urlsdelete_used_urls.py
- delete used html urls
scrapy.cfg
- scrapy project code, define settingsREADME.md
- README file
# 数据预处理
python ./preprocess.py
# 词嵌入
python ./word_embedding.py
# 传统模型预测(后两个参数只在Model为Multi-layer Perceptron时起效果)
python ./run.py --model [Model Name] --hidden_layer_sizes [隐藏层大小] --max_iter [最大迭代次数]
# 运行所有预测模型
chmod +x ./run.sh
./run.sh
# 神经网络预测
python ./nn_pred.py
- 王鑫 - 520021910700
- 郑宇森 - 520021911173
- 江彦泽 - 520021910629