參考學(xué)習(xí)資料:https://github.com/karpathy/arxiv-sanity-preserver#arxiv-sanity-preserver
這是一個論文檢索引擎
先來一段介紹:
arxiv sanity preserver
This project is a web interface that attempts to tame the overwhelming flood of papers on Arxiv. It allows researchers to keep track of recent papers, search for papers, sort papers by similarity to any paper, see recent popular papers, to add papers to a personal library, and to get personalized recommendations of (new or old) Arxiv papers. This code is currently running live at www.arxiv-sanity.com/, where it's serving 25,000+ Arxiv papers from Machine Learning (cs.[CV|AI|CL|LG|NE]/stat.ML) over the last ~3 years. With this code base you could replicate the website to any of your favorite subsets of Arxiv by simply changing the categories infetch_papers.py
.
以上介紹的大概意思就是說這個搜索引擎很智能膊毁,想關(guān)注什么領(lǐng)域的最新進(jìn)展就把喜歡的主題詞在infetch_papers.py
做一下更改即可荣赶,這是機器學(xué)習(xí)的杰作等等。
幾秒鐘就能注冊成功梆靖,跟你打字速度一樣快格嗅,進(jìn)入之后是這么個界面:
代碼布局
代碼有兩大部分:
索引代碼。使用 Arxiv API 下載任何你喜歡的類別的最新論文,然后下載所有論文妓柜,提取所有文本,根據(jù)每篇論文的內(nèi)容創(chuàng)建 tfidf 向量涯穷。因此棍掐,此代碼與后端抓取和計算有關(guān):建立 arxiv 論文數(shù)據(jù)庫、計算內(nèi)容向量拷况、創(chuàng)建縮略圖作煌、為人計算 SVM 等。
用戶界面赚瘦。然后是一個網(wǎng)絡(luò)服務(wù)器(基于Flask/Tornado/sqlite)粟誓,允許通過數(shù)據(jù)庫搜索和過濾相似文件,等等起意。
Dependencies
Several: You will need numpy
, feedparser
(to process xml files), scikit learn
(for tfidf vectorizer, training of SVM), flask
(for serving the results), flask_limiter
, and tornado
(if you want to run the flask server in production). Also dateutil
, and scipy
. And sqlite3
for database (accounts, library support, etc.). Most of these are easy to get through pip
, e.g.:
$ virtualenv env # optional: use virtualenv
$ source env/bin/activate # optional: use virtualenv
$ pip install -r requirements.txt
此外還可能需要 ImageMagick 和 pdftotext, 可通過Ubuntu 系統(tǒng)指令 sudo apt-get install imagemagick poppler-utils
完成鹰服,好多的依賴。
流程如下杜恰,最好是按順序來:
- Run
fetch_papers.py
to query arxiv API and create a filedb.p
that contains all information for each paper. This script is where you would modify the query, indicating which parts of arxiv you'd like to use. Note that if you're trying to pull too many papers arxiv will start to rate limit you. You may have to run the script multiple times, and I recommend using the arg--start-index
to restart where you left off when you were last interrupted by arxiv. - Run
download_pdfs.py
, which iterates over all papers in parsed pickle and downloads the papers into folderpdf
- Run
parse_pdf_to_text.py
to export all text from pdfs to files intxt
- Run
thumb_pdf.py
to export thumbnails of all pdfs tothumb
- Run
analyze.py
to compute tfidf vectors for all documents based on bigrams. Saves atfidf.p
,tfidf_meta.p
andsim_dict.p
pickle files. - Run
buildsvm.py
to train SVMs for all users (if any), exports a pickleuser_sim.p
- Run
make_cache.py
for various preprocessing so that server starts faster (and make sure to runsqlite3 as.db < schema.sql
if this is the very first time ever you're starting arxiv-sanity, which initializes an empty database). - Start the mongodb daemon in the background. Mongodb can be installed by following the instructions here - https://docs.mongodb.com/tutorials/install-mongodb-on-ubuntu/.
- Start the mongodb server with -
sudo service mongod start
. - Verify if the server is running in the background : The last line of /var/log/mongodb/mongod.log file must be -
[initandlisten] waiting for connections on port <port>
- Run the flask server with
serve.py
. Visit localhost:5000 and enjoy sane viewing of papers!
可選項: 你也可以運行twitter_daemon.py
在screen session, 使用Twitter API credentials (stored in twitter.txt) Twitter periodically looking for mentions of papers in the database, 并且可以把搜索結(jié)果寫入twitter.p
.
作者說還有一個簡單的shell腳本获诈,通過逐個運行這些命令仍源,他會每天運行這個腳本來獲取新論文,將它們合并到數(shù)據(jù)庫中舔涎,并重新計算所有tfidf矢量/分類器笼踩。有關(guān)此過程的更多詳細(xì)信息,請參閱下文亡嫌。
protip: numpy/BLAS: 腳本analyze.py
與numpy
執(zhí)行大量繁重的工作嚎于。作者建議小心地設(shè)置你的numpy使用BLAS(例如OpenBLAS),否則計算將需要很長時間挟冠。該腳本擁有 25于购,000 篇論文和 5000 名用戶,使用與 BLAS 鏈接的 numpy
在他的計算機上運行了幾個小時知染。
Running online
If you'd like to run the flask server online (e.g. AWS) run it as python serve.py --prod
.
You also want to create a secret_key.txt
file and fill it with random text (see top of serve.py
).
Current workflow
作者說他這個運作現(xiàn)在還不是全自動的肋僧,那他怎么讓代碼活到現(xiàn)在呢,他通過一個腳本控淡,在 arxiv 出來后(~midnight PST) 執(zhí)行了以下更新:
python fetch_papers.py
python download_pdfs.py
python parse_pdf_to_text.py
python thumb_pdf.py
python analyze.py
python buildsvm.py
python make_cache.py
作者使用的 screen session,所以設(shè)置screen -S serve
參數(shù) (或-r
to reattach to it) 然后在運行:
python serve.py --prod --port 80
服務(wù)器將加載新文件并開始托管站點嫌吠。請注意,在某些系統(tǒng)上掺炭,如果沒有 sudo
辫诅,您無法使用端口 80。兩個選項是使用iptables
重置路由端口涧狮,或者可以使用 setcap來授予運行serve.py
的python
解釋器的權(quán)限炕矮。在這種情況下,我建議謹(jǐn)慎對待權(quán)限者冤,也許可以嘗試用虛擬機肤视?(不是太明白這個設(shè)置,應(yīng)該是怕資料泄露之類的)等等涉枫。
因為還沒有系統(tǒng)的學(xué)習(xí)過python钢颂,暫時還不敢隨意嘗試。
ImageMagick
這里提到的依賴工具其中一個是個類似作弊器一樣的東西(美圖秀秀+全能掃描王拜银?)http://www.imagemagick.org/script/index.php
也是個開源的免費軟件目前版本是ImageMagick 7.0.9-2. 兼容 Linux, Windows, Mac Os X, iOS, Android OS, 及其他.
可參考ImageMagick使用實例來使用ImageMagick用 command-line 完成任務(wù). 也可參見 Fred's ImageMagick Scripts: 里面包括執(zhí)行幾何變換、模糊遭垛、銳化尼桶、邊緣、降噪和顏色操作的大量命令行腳本锯仪。也可以用參考Magick.NET,使用ImageMagick可不用安裝客戶端泵督。
下載安裝參考:http://www.imagemagick.org/script/download.php
另一個是個讀PDF并轉(zhuǎn)為文檔的工具 pdftotext
在開源的XpdfReader代碼上做了修飾的一個工具http://www.xpdfreader.com/