基于python3.6安裝的組件包列表
$ pip list
Package Version
---------------------- -------------
aniso8601 2.0.0
asn1crypto 0.23.0
astroid 1.6.2
attrs 17.2.0
Automat 0.6.0
awscli 1.14.14
bcrypt 3.1.4
beautifulsoup4 4.6.0
bleach 1.5.0
boto 2.48.0
boto3 1.5.8
botocore 1.8.22
bs4 0.0.1
bz2file 0.98
certifi 2017.7.27.1
cffi 1.11.0
chardet 3.0.4
click 6.7
colorama 0.3.9
constantly 15.1.0
coreapi 2.3.3
coreschema 0.0.4
cryptography 2.0.3
cssselect 1.0.1
cycler 0.10.0
cymem 1.31.2
cypari 2.2.0
Cython 0.28.2
cytoolz 0.8.2
de-core-news-sm 2.0.0
decorator 4.1.2
dill 0.2.7.1
Django 1.11.5
django-redis 4.8.0
django-rest-swagger 2.1.2
djangorestframework 3.7.3
docutils 0.14
dpath 1.4.2
en-blade-model-sm 2.0.0
en-core-web-lg 2.0.0
en-core-web-md 2.0.0
en-core-web-sm 2.0.0
entrypoints 0.2.3
es-core-news-sm 2.0.0
fabric 2.0.1
Fabric3 1.14.post1
fasttext 0.8.3
flasgger 0.8.3
Flask 1.0.2
Flask-RESTful 0.3.6
flask-swagger 0.2.13
fr-core-news-md 2.0.0
fr-core-news-sm 2.0.0
ftfy 4.4.3
future 0.16.0
FXrays 1.3.3
gensim 3.0.0
h5py 2.7.1
html5lib 0.9999999
hyperlink 17.3.1
idna 2.6
incremental 17.5.0
invoke 1.0.0
ipykernel 4.6.1
ipython 6.2.0
ipython-genutils 0.2.0
ipywidgets 7.0.1
isort 4.3.4
it-core-news-sm 2.0.0
itsdangerous 0.24
itypes 1.1.0
jedi 0.11.0
jieba 0.39
Jinja2 2.10
jmespath 0.9.3
joblib 0.11
jsonpath 0.75
jsonschema 2.6.0
jupyter 1.0.0
jupyter-client 5.1.0
jupyter-console 5.2.0
jupyter-core 4.3.0
Keras 2.0.8
keyring 12.2.1
lazy-object-proxy 1.3.1
lxml 4.0.0
Markdown 2.6.9
MarkupSafe 1.0
matplotlib 2.0.2
mccabe 0.6.1
mistune 0.8.3
msgpack-numpy 0.4.1
msgpack-python 0.5.2
murmurhash 0.28.0
nbconvert 5.3.1
nbformat 4.4.0
ner 0.1
networkx 2.1
nltk 3.2.5
notebook 5.1.0
numpy 1.14.0rc1+mkl
olefile 0.44
openapi-codec 1.3.2
pandas 0.23.0
pandocfilters 1.4.2
paramiko 2.4.1
parsel 1.2.0
parso 0.1.0
pathlib 1.0.1
pickleshare 0.7.4
Pillow 4.2.1
pip 10.0.1
plac 0.9.6
plink 2.2
preshed 1.0.0
prompt-toolkit 1.0.15
protobuf 3.4.0
psutil 5.4.3
py4j 0.10.6
pyasn1 0.3.6
pyasn1-modules 0.1.4
pycparser 2.18
PyDispatcher 2.0.5
Pygments 2.2.0
pylint 1.8.3
pymssql 2.1.3
PyMySQL 0.7.11
PyNaCl 1.2.1
pyOpenSSL 17.3.0
pyparsing 2.2.0
pypng 0.0.18
pyreadline 2.1
pyspark 2.3.0
python-dateutil 2.6.1
python-snappy 0.5.2
pytz 2017.2
pywin32-ctypes 0.1.2
PyYAML 3.12
pyzmq 16.0.2
qtconsole 4.3.1
queuelib 1.4.2
redis 2.10.6
regex 2017.4.5
requests 2.18.4
rsa 3.4.2
s3transfer 0.1.12
scikit-learn 0.19.1
scikit-surprise 1.0.6
scipy 1.0.0
Scrapy 1.4.0
seaborn 0.8.1
service-identity 17.0.0
setuptools 28.8.0
sh 1.12.14
simplegeneric 0.8.1
simplejson 3.13.2
six 1.11.0
smart-open 1.5.3
snappy 2.6
snappy-manifolds 1.0
spacy 2.0.7
spherogram 1.8
stanfordcorenlp 3.8.0.1
swagger-py-codegen 0.2.9
tensorflow 1.3.0
tensorflow-tensorboard 0.1.8
termcolor 1.1.0
testpath 0.3.1
Theano 0.9.0
thinc 6.10.2
toolz 0.9.0
tornado 4.5.2
tqdm 4.19.5
traitlets 4.3.2
tushare 1.1.6
Twisted 17.9.0
ujson 1.35
uritemplate 3.0.0
urllib3 1.22
virtualenv 15.1.0
w3lib 1.18.0
wcwidth 0.1.7
webencodings 0.5.1
Werkzeug 0.14.1
wheel 0.30.0
widgetsnbextension 3.0.3
wordcloud 1.4.1
wrapt 1.10.11
xgboost 0.71
zope.interface 4.4.3
Python Windows Binaries 庫下載地址
Unofficial Windows Binaries for Python Extension Packages
如果在Windows, pip install python庫失敗颖御,或許是依賴的某個python包安裝失敗鼎兽。
可以嘗試從這個網(wǎng)站慌烧,將安裝失敗的組件包下載到本地择示,進(jìn)行直接文件安裝之后媚创,再嘗試重新安裝主組件包
Python代碼Sample網(wǎng)址
常用組件包
1. 數(shù)據(jù)科學(xué)領(lǐng)域
Numpy
Numpy提供了兩種基本的對象:ndarray和ufunc。ndarray是存儲單一數(shù)據(jù)類型的多維數(shù)組雌桑,而ufunc是能夠?qū)?shù)組進(jìn)行處理的函數(shù)妇菱。Numpy的功能:
- N維數(shù)組,一種快速虑乖、高效使用內(nèi)存的多維數(shù)組懦趋,他提供矢量化數(shù)學(xué)運(yùn)算。
- 可以不需要使用循環(huán)疹味,就能對整個數(shù)組內(nèi)的數(shù)據(jù)進(jìn)行標(biāo)準(zhǔn)數(shù)學(xué)運(yùn)算仅叫。
- 非常便于傳送數(shù)據(jù)到用低級語言編寫(C\C++)的外部庫,也便于外部庫以Numpy數(shù)組形式返回數(shù)據(jù)。
Numpy不提供高級數(shù)據(jù)分析功能糙捺,但可以更加深刻的理解Numpy數(shù)組和面向數(shù)組的計算诫咱。
Pandas
Pandas是Python的一個數(shù)據(jù)分析包,Pandas最初被用作金融數(shù)據(jù)分析工具而開發(fā)出來洪灯,因此Pandas為時間序列分析提供了很好的支持坎缭。
Pandas是為了解決數(shù)據(jù)分析任務(wù)而創(chuàng)建的,Pandas納入了大量的庫和一些標(biāo)準(zhǔn)的數(shù)據(jù)模型签钩,提供了高效的操作大型數(shù)據(jù)集所需要的工具掏呼。Pandas提供了大量是我們快速便捷的處理數(shù)據(jù)的函數(shù)和方法。Pandas包含了高級數(shù)據(jù)結(jié)構(gòu)铅檩,以及讓數(shù)據(jù)分析變得快速憎夷、簡單的工具。它建立在Numpy之上昧旨,使得Numpy應(yīng)用變得簡單拾给。
- 帶有坐標(biāo)軸的數(shù)據(jù)結(jié)構(gòu),支持自動或明確的數(shù)據(jù)對齊兔沃。這能防止由于數(shù)據(jù)結(jié)構(gòu)沒有對齊鸣戴,以及處理不同來源、采用不同索引的數(shù)據(jù)而產(chǎn)生的常見錯誤粘拾。
- 使用Pandas更容易處理丟失數(shù)據(jù)窄锅。
- 合并流行數(shù)據(jù)庫(如:基于SQL的數(shù)據(jù)庫)
Pandas是進(jìn)行數(shù)據(jù)清晰/整理的最好工具。
Pandas是機(jī)器學(xué)習(xí)組件應(yīng)用的數(shù)據(jù)處理基石
Matplotlib
Matplotlib是Python的一個可視化模塊缰雇,他能方便的只做線條圖入偷、餅圖、柱狀圖以及其他專業(yè)圖形械哟。
使用Matplotlib疏之,可以定制所做圖表的任一方面。他支持所有操作系統(tǒng)下不同的GUI后端暇咆,并且可以將圖形輸出為常見的矢量圖和圖形測試锋爪,如PDF SVG JPG PNG BMP GIF.通過數(shù)據(jù)繪圖丙曙,我們可以將枯燥的數(shù)字轉(zhuǎn)化成人們?nèi)菀捉邮盏膱D表。
Matplotlib是基于Numpy的一套Python包其骄,這個包提供了吩咐的數(shù)據(jù)繪圖工具亏镰,主要用于繪制一些統(tǒng)計圖形。
Matplotlib有一套允許定制各種屬性的默認(rèn)設(shè)置拯爽,可以控制Matplotlib中的每一個默認(rèn)屬性:圖像大小索抓、每英寸點數(shù)、線寬毯炮、色彩和樣式逼肯、子圖、坐標(biāo)軸桃煎、網(wǎng)個屬性篮幢、文字和文字屬性。
2. 機(jī)器學(xué)習(xí)領(lǐng)域
Scikit-Learn
Scikit-Learn是基于Python機(jī)器學(xué)習(xí)的模塊为迈,基于BSD開源許可證三椿。
Scikit-Learn的安裝需要Numpy Scopy Matplotlib等模塊,Scikit-Learn的主要功能分為六個部分曲尸,分類赋续、回歸男翰、聚類另患、數(shù)據(jù)降維、模型選擇蛾绎、數(shù)據(jù)預(yù)處理昆箕。
Scikit-Learn自帶一些經(jīng)典的數(shù)據(jù)集,比如用于分類的iris和digits數(shù)據(jù)集租冠,還有用于回歸分析的boston house prices數(shù)據(jù)集鹏倘。該數(shù)據(jù)集是一種字典結(jié)構(gòu),數(shù)據(jù)存儲在.data成員中顽爹,輸出標(biāo)簽存儲在.target成員中纤泵。Scikit-Learn建立在Scipy之上,提供了一套常用的機(jī)器學(xué)習(xí)算法镜粤,通過一個統(tǒng)一的接口來使用捏题,Scikit-Learn有助于在數(shù)據(jù)集上實現(xiàn)流行的算法。
Scikit-Learn還有一些庫肉渴,比如:用于自然語言處理的Nltk公荧、用于網(wǎng)站數(shù)據(jù)抓取的Scrappy、用于網(wǎng)絡(luò)挖掘的Pattern同规、用于深度學(xué)習(xí)的Theano等循狰。
Xgboost
Xgboost窟社,顧名思義是極度梯度提升算法,用于監(jiān)督學(xué)習(xí)绪钥。
可以這樣理解灿里,一般遇到分類問題,可以用隨機(jī)森林或者Xgboost先試一下結(jié)果昧识。
Introduction to Boosted Trees
XGBoost is short for “Extreme Gradient Boosting”, where the term “Gradient Boosting” is proposed in the paper Greedy Function Approximation: A Gradient Boosting Machine, by Friedman. XGBoost is based on this original model. This is a tutorial on gradient boosted trees, and most of the content is based on these slides by the author of xgboost.
The GBM (boosted trees) has been around for really a while, and there are a lot of materials on the topic. This tutorial tries to explain boosted trees in a self-contained and principled way using the elements of supervised learning. We think this explanation is cleaner, more formal, and motivates the variant used in xgboost.
Elements of Supervised Learning
XGBoost is used for supervised learning problems, where we use the training data (with multiple features) xi to predict a target variable yi. Before we dive into trees, let us start by reviewing the basic elements in supervised learning
TensorFlow
TensorFlow是谷歌基于DistBelief進(jìn)行研發(fā)的第二代人工智能學(xué)習(xí)系統(tǒng)钠四,其命名來源于本身的運(yùn)行原理。Tensor(張量)意味著N維數(shù)組跪楞,F(xiàn)low(流)意味著基于數(shù)據(jù)流圖的計算缀去,TensorFlow為張量從流圖的一端流動到另一端計算過程。TensorFlow是將復(fù)雜的數(shù)據(jù)結(jié)構(gòu)傳輸至人工智能神經(jīng)網(wǎng)中進(jìn)行分析和處理過程的系統(tǒng)甸祭。
TensorFlow可被用于語音識別或圖像識別等多項機(jī)器學(xué)習(xí)和深度學(xué)習(xí)領(lǐng)域缕碎,對2011年開發(fā)的深度學(xué)習(xí)基礎(chǔ)架構(gòu)DistBelief進(jìn)行了各方面的改進(jìn),它可在小到一部智能手機(jī)池户、大到數(shù)千臺數(shù)據(jù)中心服務(wù)器的各種設(shè)備上運(yùn)行咏雌。TensorFlow將完全開源,任何人都可以用校焦。
Spacy
Spacy
Spacy是隸屬于NLP(自然語言處理)的python組件赊抖。
官方介紹:spaCy excels at large-scale information extraction tasks. It's written from the ground up in carefully memory-managed Cython. Independent research has confirmed that spaCy is the fastest in the world. If your application needs to process entire web dumps, spaCy is the library you want to be using.
實際應(yīng)用中:Spacy支持多語言,提供相對完善的已有模型寨典,做分詞氛雪,實體識別非常好用,而且效率很高
與主流NLP組件的對比:
Gensim
我寫的一篇用Gensim做相似度應(yīng)用的帖子
Gensim is a free Python library designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.
Gensim is designed to process raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation and Random Projections discover semantic structure of documents by examining statistical co-occurrence patterns of the words within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.
Once these statistical patterns are found, any plain text documents can be succinctly expressed in the new, semantic representation and queried for topical similarity against other documents.
3. 演示或練習(xí)領(lǐng)域
Jupyter + IPython
IPython provides a rich architecture for interactive computing with:
- A powerful interactive shell.
- A kernel for Jupyter.
- Support for interactive data visualization and use of GUI toolkits.
- Flexible, embeddable interpreters to load into your own projects.
- Easy to use, high performance tools for parallel computing.
IPython is a growing project, with increasingly language-agnostic components. IPython 3.x was the last monolithic release of IPython, containing the notebook server, qtconsole, etc. As of IPython 4.0, the language-agnostic parts of the project: the notebook format, message protocol, qtconsole, notebook web application, etc. have moved to new projects under the name Jupyter. IPython itself is focused on interactive Python, part of which is providing a Python kernel for Jupyter.
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.
4. Web開發(fā)領(lǐng)域
Django
在 Python 社區(qū)耸成,Django 是目前最有影響力的 Web 開發(fā)框架报亩。該框架很重型,內(nèi)置了 Web 服務(wù)端開發(fā)常用的組件井氢。
Django 應(yīng)用范圍很廣弦追,比如 Google 的 Web 開發(fā)平臺 GAE 就支持它。
Django 完全支持 Jython 運(yùn)行環(huán)境花竞,可以運(yùn)行在任何 J2EE 服務(wù)器上劲件。
Flask
Flask是一個Python編寫的Web 微框架,讓我們可以使用Python語言快速實現(xiàn)一個網(wǎng)站或Web服務(wù)约急。
使用Flask可以非沉阍叮快捷的開發(fā)與部署Restful API,而且可以結(jié)合flask-swagger組件烤宙,非常方便發(fā)布Swagger API演示W(wǎng)ebsite
5. 運(yùn)維工具
Fabric
Fabric是一個Python的庫遍烦,它提供了豐富的同SSH交互的接口,可以用來在本地或遠(yuǎn)程機(jī)器上自動化躺枕、流水化地執(zhí)行Shell命令服猪。因此它非常適合用來做應(yīng)用的遠(yuǎn)程部署及系統(tǒng)維護(hù)供填。其上手也極其簡單,你需要的只是懂得基本的Shell命令罢猪。
需要注意的是:
- 安裝版本
-- 對于2.x版本近她,安裝通過pip install fabric
-- 對于3.x版本,安裝通過pip install fabric3
示例:
from fabric.api import *
env.hosts = ["10.86.17.84"]
env.user = "hduser"
env.key_filename="./conf/id_rsa.pem"
def testls():
run("ls -l /usr/local")
def starthadoop():
run("/usr/local/hadoop/sbin/start-all.sh")
run("screen -d -m /usr/local/hive/bin/hive --service metastore -p 9083", pty=False)
run("/usr/local/spark/sbin/start-all.sh")
def stophadoop():
run("/usr/local/spark/sbin/stop-all.sh")
run("/usr/local/hadoop/sbin/stop-all.sh")
pids = run("ps -ef | grep hive | grep -v 'grep' | awk '{print $2'}")
pid_list = pids.split('\r\n')
if len(pid_list) >=1 and len(pid_list[0]) != 0:
for i in pid_list:
run('kill -9 %s' % i)
6. AWS專用組件
BOTO3
Boto 是AWS的基于python的SDK(當(dāng)然還支持其他語言的SDK膳帕,例如Ruby, Java等)粘捎,Boto允許開發(fā)人員編寫軟件時使用亞馬遜等服務(wù)像S3和EC2等,Boto提供了簡單危彩,面向?qū)ο蟮腁PI攒磨,也提供了低等級的服務(wù)接入。這里大家要區(qū)分汤徽,Boto有兩個版本娩缰,其中舊的版本boto2已經(jīng)不推薦使用了,在一些亞馬遜新建的region已經(jīng)不支持舊的Boto2了(貌似中國就是這樣的)谒府,所以如果開發(fā)Python代碼的話建議大家使用Boto3拼坎,為什么不推薦Boto2呢?應(yīng)為Boto2大概是2006年開發(fā)的完疫,現(xiàn)在的好多的服務(wù)當(dāng)時都沒有開發(fā)出來泰鸡,所以Boto2點設(shè)計沒有考慮后續(xù)的這么多新增多服務(wù),所以重新開發(fā)了Boto3.
目前通過boto3控制AWS resource非常簡單壳鹤,只要~/.aws/credentials 配置OK盛龄,通過如下語句,就能連上S3:
import boto3
s3 = boto3.resource("s3")
for bucket in s3.buckets.all():
print(bucket.name)
#boto3上傳object to s3:
#多個tagging通過Tagging="key1=value1&key2=value2" 這種方式生成
s3.Bucket(s3conf["bucket"]). \
put_object(
Key=gzipfile.replace(".\output\\", "").replace("\\", "/")
, Body=data
, Tagging="companyid=0C0000183U")