Python的開發(fā)跟其他的一些語言是有很大不同的. 她和Ruby, Perl一樣都是解釋型語言,所以開發(fā)者能夠交互式編程環(huán)境來實(shí)時的測試執(zhí)行代碼. Python的這一特性意味著她在不用編譯, 能夠用來快速地開發(fā)和調(diào)試代碼原型. Python類似于Scala和Javascript都包含了很多實(shí)用的開發(fā)工具來幫助腳本式的開發(fā). 但Python同時又是像Java和C++一樣, 具有很強(qiáng)擴(kuò)展性, 能模塊化編程的面向?qū)ο缶幊陶Z言,而不僅僅是簡單的執(zhí)行腳本.
一般Python用來快速執(zhí)行的單一腳本, 用類似Django這樣的大型可擴(kuò)展框架來開發(fā)網(wǎng)站應(yīng)用, 用Celery做數(shù)據(jù)處理等, 甚至科學(xué)計算科學(xué)處理都占Python應(yīng)用的一大部分.Python這門輕量級高效的編程語言, 在大多系統(tǒng)都是默認(rèn)就安裝了的, 所以呢, 用她來做數(shù)據(jù)分析,數(shù)據(jù)處理, 任務(wù)分析等任務(wù)就是第一選擇了.
然而, Python的一大缺點(diǎn)就是沒有一整套的開發(fā)流程, 自然也是沒有一個標(biāo)準(zhǔn)的IDE或者是開發(fā)框架. 大部分Python參考資料都是教你怎么去使用這門腳本語言,完全忽略了一個重點(diǎn),那就是如何去構(gòu)建一個大型的Python項(xiàng)目. 這篇文章就是來介紹下用Python來構(gòu)建大型數(shù)據(jù)類型項(xiàng)目的一個流程.
開發(fā)環(huán)境
那么, 要完成成功的開發(fā)數(shù)據(jù)項(xiàng)目這目標(biāo), 你應(yīng)該需要些什么呢? 很簡單的兩點(diǎn):
- 文本編輯器, Notepad++, Vim, Emacs 或者 Text Wrangler等都行.(譯注: Sublime)
- 終端, 當(dāng)然你得把環(huán)境變量設(shè)好.(譯注: 把Python的path加入PATH環(huán)境變量中)
對, 只需要這兩個! 當(dāng)然也有很多帶調(diào)試, 代碼補(bǔ)全和語法高亮的開發(fā)環(huán)境. 然而這些東西歸根到底, 都只是把文本編輯器和終端結(jié)合, 然后添加了一些使用的功能. 如果你執(zhí)意要使用IDE, 那么我推薦一些的一些:
- IDLE -這個對于Windows用戶可能會很熟悉, 因?yàn)橥ǔK麄兊牡谝粋€Python程序就是在這里完工的. 雖然她很簡單, 但是Python自帶的而且效率也還不錯.
- Komodo Edit - 這款免費(fèi)IDE是由ActiveState公司操刀的, 提供了很多的工具和實(shí)用的功能.
- PyCharm - 雖然收費(fèi), 但是絕對值, 用起來和 IntelliJ 一樣.
- Aptana Studio - 雖然她是助攻 Web 開發(fā)的, 但是也內(nèi)置了對 Python 的支持.
- Spyder - 專注于科學(xué)計算.
- iPython - 交互式開發(fā)環(huán)境, 可以保存運(yùn)行的 Python 代碼和數(shù)據(jù).
然而, 即使你使用了這些工具, 你還是會回到下面要講的基本開發(fā)流程. Sublime Text 3具有很多巧妙而又強(qiáng)大的特性, 語法高亮也只需要添加pdb
文件, 同時還有命令行, 所以很多獨(dú)立開發(fā)者都是使用她作為他們的首要工具.
隨著你項(xiàng)目的增大, 你也會使用到下面的一些工具:
- Git/Github.com - 版本控制, 代碼托管.
- pip - Python 第三方工具, 庫的包管理
- virtualenv and virtualenvwrapper - 虛擬開發(fā)環(huán)境, 各個項(xiàng)目的包依賴就不用混亂了.
還有很多使用的輔助開發(fā)工具, 但是這三個工具在當(dāng)前 Python 開發(fā)中是比較重要而且比較常用的, 下面我還會進(jìn)一步講到.
第三方庫
在開發(fā)的過程中, 不可避免,你肯定會或多或少地使用到第三方庫, 特別是在做數(shù)據(jù)處理時需要像 Numpy, Pandas等其它的工具. 安裝這些庫在你的系統(tǒng)上通常只需要使用pip
-python的包管理工具.使用pip
會幫你解決不少麻煩,節(jié)省時間, 當(dāng)然你得在你的機(jī)子上先安裝好她!
requests.py
是一個很簡單的HTTP庫, 很容易實(shí)現(xiàn)請求web數(shù)據(jù). 要安裝她只需要使用下面簡單的命令:
$ pip install requests
安裝,卸載,更新都是用pip
這個命令. pip freeze
能夠查看你系統(tǒng)上安裝的python庫. 要搜索可用的庫,到這里 Python Package Index (PyPI).
虛擬環(huán)境
當(dāng)你開發(fā)的東西越來越多, 你會發(fā)現(xiàn)有一些特殊版本的工具或者工具是很難運(yùn)行起來, 特定的項(xiàng)目要特定版本的庫或工具, 有時候還有和其他項(xiàng)目用到的庫發(fā)生沖突. 當(dāng)開發(fā)Python2 和Python3 兩個版本時, 甚至 Python 本省就有問題, 有可能(很小)你在開發(fā)的時候系統(tǒng)崩潰.
解決辦法是用給開發(fā)包一個專門的虛擬環(huán)境, 然后在這個環(huán)境下開發(fā)項(xiàng)目. 虛擬環(huán)境可用可以創(chuàng)建一個包含特定版本Python,pip, 以及第三方包的目錄. 這個虛擬環(huán)境在命令行中啟用和停止, 允許用戶創(chuàng)建自己的虛擬環(huán)境. 而且她還能個匹配特定的生產(chǎn)環(huán)境(通常是Linux).
Virtualenvwrapper 是另外一個能夠讓你管理多喝虛擬環(huán)境并把他們關(guān)聯(lián)成一個特定項(xiàng)目的庫. 這個工具同樣必不可少的. 用下面的命令來安裝這兩個工具:
$ pip install virtualenv virtualenvwrapper
然后在你的家目錄下編輯.profile
文件,并在最后添加下面下面幾行:
export WORKON_HOME=$HOME/.virtualenvs
export PROJECT_HOME=$HOME/Projects
source /usr/local/bin/virtualenvwrapper.sh
你所有的虛擬環(huán)境都會存在一個叫virtualenvs
的隱藏目錄下, 你的項(xiàng)目目錄就是用來存放你代碼的地方, 我在下面來討論這塊.為了更方便的使用, 我給irtualenv
腳本做了很多別號,可以在Ben's VirtualEnv Cheat Sheet查看擴(kuò)充.
注意: Windows 用戶可能需要每個系統(tǒng)有所差別.
代碼構(gòu)建流程
有一下兩種形式的代碼構(gòu)建和執(zhí)行:
- 把代碼寫到文本文件中,然后用
python
執(zhí)行 - 把代碼寫到文本文件中,然后導(dǎo)入到交互式編程環(huán)境中.
Generally speaking, developers do both. Python programs are intended to be executed on the command line via thepython
binary, and the thing that is executed is usually an entry point to a much larger library of code that is imported. The difference between importing and execution is subtle, but as you do more Python it becomes more important.
With either of these workflows, you create your code in as modular a fashion as possible and, during the creation process, you execute it in one of the methods described above to check it's working. Most Python developers are back and forth between their terminal and the editor, and can do fine grained testing of every single line of code as they're writing it. This is the rapid prototyping aspect of Python.
So let's start with a simple example.
Open a terminal window (see your specific operating system for instructions on how to do this).
NOTE: Commands are in bash (Linux/Mac) or Windows Powershell
Create a workspace for yourself. A workspace, in this sense, is just an empty directory where you can get ready to start doing development work. You should probably also keep your various projects (here, a synonym for workspace) in their own directory as well, for now we'll just call it "Projects" and assume it is in your home directory. Our first project will be called "myproject", but you'd just name this whatever you'd like.
$ cd ~/Projects$ mkdir myproject$ cd myproject
Let's create our first Python script. You can either open your favorite editor and save the file into your workspace (the ~/Projects/myproject directory), or you can touch it and then open that file with your editor.
$ touch foo.py
PRO TIP: If you're using Sublime Text 3 and have thesubl
command line tool installed (See Sublime Text installation instructions), you can use the following command to open up the current directory in the editor:
$ subl . &
I use this so much that I've aliased the command toe
.
So here's where you should be: You should have a text editor open and editing the file at~/Projects/myproject/foo.py
, and you should have a terminal window open whose current working directory is~/Projects/myproject
. You're now ready to develop. Add the following code to foo.py:
!/usr/bin/env pythonimport csvdef dataset(path): with open(path, 'rU') as data: reader = csv.reader(data) for row in reader: row[2] = int(row[2]) yield row
This code is very simple. It just implements a function that accepts a path and returns an iterator so that you can access every row of a CSV file, while also converting the third item in every row to an integer.
PRO TIP: The#!
(pronounced "shebang") line must appear at the very beginning of an executable Python script with nothing before it. It will tell your computer that this is a Python file and execute the script correctly if run from the command line as a standalone app. This line doesn't need to appear in library modules, that is, Python code that you plan to import rather than execute.
Create some data so that we can use our function. Let's keep all of our data in a fixtures directory in our project.
$ mkdir fixtures$ touch fixtures/calories.csv
Using your editor, add this data to the calories.csv file:
butter,tbsp,102cheddar cheese,slice,113whole milk,cup,148hamburger,item,254
Ok, now it's time to use our code. First, let's try to execute the code in the interpreter. Open up the REPL as follows:
$ python>>>
You should now be presented with the Python prompt (>>>
). Anything you type in now should be in Python, not bash. Always note the prompts in the instructions. A prompt with$
means type in command line instructions (bash), a prompt that says>>>
means type in Python on the REPL, and if there is no prompt, you're probably editing a file. Import your code:
from foo import dataset>>> for row in dataset('fixtures/calories.csv'):... print row[0]buttercheddar cheesewhole milkhamburger>>>
A lot happened here, so let's inspect it. First, when you imported the dataset function from foo, Python looked in your current working directory and found thefoo.py
file, and that's where it imported it from. Where you are on the command line and what your Python path is matters!
When you import the dataset function the way we did, the module is loaded and executed all at once and provided to the interpreter's namespace. You can now use it by writing a for loop to go through every row and print the first item. Note the...
prompt. This means that Python is expecting an indented block. To exit the block, hit enter twice. The print results appear right in the screen, and then you're returned to the prompt.
But what if you make a change in the code, for example, capitalizing the first letter of the words in first item of each row? The changes you write in your file won't show up in the REPL. This is because Python has already loaded the code once. To get the changes, you either have to exit the REPL and restart or you have to import in a different way:
import foo>>> for row in foo.dataset('fixtures/calories.csv'):...
Now you can reload the foo module and get your code changes:
reload(foo)
This can get pretty unwieldy as code gets larger and more changes happen, so let's shift our development strategy over to executing Python files. Inside foo.py, add the following to the end of the file:
if name == 'main': for row in dataset('fixtures/calories.csv'): print row[0]
To execute this code, you simply type the following on the command line:
$ python foo.pybuttercheddar cheesewhole milkhamburger
Theif name == 'main':
statement means that the code will only get executed if the code is run directly, not imported. In fact, if you open up the REPL and type inimport foo
, nothing will be printed to your screen. This is incredibly useful. It means that you can put test code inside your script as you're developing it without worrying that it will interfere with your project. Not only that, it documents to other developers how the code in that file should be used and provides a simple test to check to make sure that you're not creating errors.
In larger projects, you'll see that most developers put test and debugging code under so called "ifmain" statements at the bottom of their files. You should do this too!
With this example, hopefully you have learned the workflow for developing Python programs both by executing scripts and using "ifmain" as well as importing and reloading scripts in the REPL. Most developers use both methods interchangeably, using whatever is needed at the time.
Structuring Larger Projects
Ok, so how do you write an actual Python program and move from experimenting with short snippets of code to larger programs? The first thing you have to do is organize your code into a project. Unfortunately there is really nothing to do this for you automatically, but most developers follow a well known pattern that was introduce by Zed Shaw in his book Learn Python the Hard Way.
In order to create a new project, you'll implement the "Python project skeleton," a set of directories and files that belong in every single project you create. The project skeleton is very familiar to Python developers, and you'll quickly start to recognize it as you investigate the code of other Python developers (which you should be doing). The basic skeleton is implemented inside of a project directory, which are stored in your workspace as described above. The directory structure is as follows (for an example project calledmyproject
):
$ myproject.├── README.md├── LICENSE.txt├── requirements.txt├── setup.py├── bin| └── myapp.py├── docs| ├── _build| ├── conf.py| ├── index.rst| └── Makefile├── fixtures├── foo| └── init.py└── tests └── init.py
This is a lot, but don't be intimidated. This structure implements many tools including packaging for distribution, documentation with Sphinx, testing, and more.
Let's go through the pieces one by one. Project documentation is the first part, implemented asREADME.md
andLICENSE.txt
files. The README file is a markdown document that you can add developer-specific documentation to your project. The LICENSE can be any open source license, or a Copyright statement in the case of proprietary code. Both of these files are typically generated for you if you create your project in Github. If you do create your file in Github, you should also use the Python.gitignore
that Github provides, which helps keep your repositories clean.
Thesetup.py
script is a Python setuptools or distutils installation script and will allow you to configure your project for deployment. It will use therequirements.txt
to specify the third party dependencies required to implement your project. Other developers will also use these files to create their development environments.
Thedocs
directory contains the Sphinx documentation generator, Python documentation is written in restructuredText, a Markup language similar to Markdown and others. This documentation should be more extensive and should be for both users and developers. Thebin
directory will contain any executable scripts you intend to build. Data scientists also typically also have afixtures
directory in which to store data files.
Thefoo
andtests
directories are actually Python modules since they contain the__init__.py
file. You'll put your code in foo and your tests in tests. Once you start developing inside your foo directory, note that when you open up the REPL, you have to import everything from the 'foo' namespace. You can put import statements in your__init__.py
files to make things easier to import as well. You can still also execute your scripts in the foo directory using the "ifmain" method.
Setting Up Your First Project
You don't have to manually create the structure above, many tools will help you build this environment. For example the Cookiecutter project will help you manage project templates and quickly build them. The spinx-quickstart command will generate your documentation directory. Github will add theREADME.md
andLICENSE.txt
stubs. Finally,pip freeze
will generate therequirements.txt
file.
Starting a Python project is a ritual, however, so I will take you through my process for starting one. Light a candle, roll up your sleeves, and get a coffee. It's time.
Inside of your Projects directory, create a directory for your workspace (project). Let's pretend that we're building a project that will generate a social network from emails, we'll call it "emailgraph."
$ mkdir ~/Projects/emailgraph$ cd ~/Projects/emailgraph
Initialize your repository with Git.
$ git init
Initialize your virtualenv with virtualenv wrapper.
$ mkvirtualenv -a $(pwd) emailgraph
This will create the virtual environment in ~/.virtualenvs/emailgraph and automatically activate it for you. At any time and at any place on the command line, you can issue theworkon emailgraph
command and you'll be taken to your project directory (the-a
flag specifies that this is the project directory for this virtualenv).
Create the various directories that you'll require:
(emailgraph)$ mkdir bin tests emailgraph docs fixtures
And then create the various files that are needed:
(emailgraph)$ touch tests/init.py(emailgraph)$ touch emailgraph/init.py(emailgraph)$ touch setup.py README.md LICENSE.txt .gitignore(emailgraph)$ touch bin/emailgraph-admin.py
Generate the documentation usingsphinx-quickstart
:
(emailgraph)$ sphinx-quickstart
You can safely use the defaults, but make sure that you do accept the Makefile at the end to quickly and easily generate the documentation. This should create an index.rst and conf.py file in yourdocs
directory.
Install nose and coverage to begin your test harness:
(emailgraph)$ pip install nose coverage
Open up thetests/init.py
file with your favorite editor, and add the following initialization tests:
import unittestclass InitializationTests(unittest.TestCase): def test_initialization(self): """ Check the test suite runs by affirming 2+2=4 """ self.assertEqual(2+2, 4) def test_import(self): """ Ensure the test suite can import our module """ try: import emailgraph except ImportError: self.fail("Was not able to import the emailgraph")
From your project directory, you can now run the test suite, with coverage as follows:
(emailgraph)$ nosetests -v --with-coverage --cover-package=emailgraph \ --cover-inclusive --cover-erase tests
You should see two tests passing along with a 100% test coverage report.
Open up thesetup.py
file and add the following lines:
!/usr/bin/env pythonraise NotImplementedError("Setup not implemented yet.")
Setting up your app for deployment is the topic of another post, but this will alert other developers to the fact that you haven't gotten around to it yet.
Create therequirements.txt
file usingpip freeze
:
(emailgraph)$ pip freeze > requirements.txt
Finally, commit all the work you've done to email graph to the repository.
(emailgraph)$ git add --all(emailgraph)$ git statusOn branch masterInitial commitChanges to be committed: (use "git rm --cached <file>..." to unstage) new file: LICENSE.txt new file: README.md new file: bin/emailgraph-admin.py new file: docs/Makefile new file: docs/conf.py new file: docs/index.rst new file: emailgraph/init.py new file: requirements.txt new file: setup.py new file: tests/init.py(emailgraph)$ git commit -m "Initial repository setup"
With that you should have your project all setup and ready to go. Get some more coffee, it's time to start work!
Conclusion
With this post, hopefully you've discovered some best practices and workflows for Python development. Structuring both your code and projects this way will help keep you organized and will also help others quickly understand what you've built, which is critical when working on projects involving more than one person. More importantly, this project structure is the preparation for deployment and the base for larger applications and professional, production grade software. Whether you're scripting or writing apps, I hope that these workflows will be useful.
If you'd like to explore further how to include professional grade tools into your Python development, check out some of the following tools:
Travis-CI is a continuing integration service that will automatically run your test harness when you commit to Github. It will make sure that all of your tests are passing before you push to production!
Waffle.io will turn your Github issues into a full Agile board allowing you to track milestones and sprints, and better coordinate your team.
Pylint will automatically check for good coding standards, error detection, and even draw UML diagrams for your code!
If you're having trouble with anything we've covered or you find any errors, please leave us a comment! Also, all developers are as different as they are the same, so if you have a workflow that you think others would benefit from, please let us know in the code!
If you liked this post and found it helpful, go to the blog home page and click the Subscribe button so that you don't miss any of the awesome posts we have coming up.
擴(kuò)展閱讀
- 笨方法學(xué)Python:Learn Python the Hard Way
- Python 學(xué)習(xí), 第五版:Learning Python, 5th Edition
- Python 編程:Programming Python
- 數(shù)據(jù)科學(xué)實(shí)用手冊:Practical Data Science Cookbook
- Python for you and me
- Easy-Python
- Awesome Python
- Python Free Books
- Python Ecosystem An Introduction
- Full Stack Python
- Talk Python FM