MLfromCash: 5月 2016

2016年5月31日星期二

Python_Note12

網頁抓取與解析

抓取：urllib

解析：HTMLParser

urllib

https://docs.python.org/3/library/urllib.html#module-urllib

`urllib` is a package that collects several modules for working with URLs:

`urllib.request` for opening and reading URLs

`urllib.error` containing the exceptions raised by `urllib.request`

`urllib.parse` for parsing URLs

`urllib.robotparser` for parsing `robots.txt` files

urllib是內建module，提供一般抓取網頁的工作，可以使用urlopen函數開啟某個網址，然後將傳回的物件呼叫它的read函數，取出所有網頁的內容，最後關閉。原本可能會很複雜的工作全部都已經被包好了。

urlopen(),是基於python的open()方法
urllib.request.urlopen('網址')
傳入參數要遵循http、ftp、等網路協議

urllib.request.urlopen('http://www.yahoo.com.tw')
特別注意，協定方式一定要加 ( ex : http://)

也可以是本機端的檔案

urllib.request.urlopen('file:c:\\user\\檔名.副檔名')

網頁讀取

使用read()方法會將所有內容以bytes型態讀取出來
bytes型態可透過呼叫decode()方法來設定編碼，並轉成字串型態回傳

response = urllib.request.urlopen(‘http://invoice.etax.nat.gov.tw/’)
response.read().decode(‘utf_8’)

其中 read() 中可以傳入參數，例如read(10)則會回傳長度10的字串

範例：抓取統一發票網頁
#-*-coding:UTF-8 -*-

# 使用 urllib 讀取網頁內容範例

import urllib.request

response = urllib.request.urlopen('http://invoice.etax.nat.gov.tw/')

html = response.read().decode('utf_8')

print(html)

HTMLparser

https://docs.python.org/3/library/html.parser.html#module-html.parser

This module defines a class HTMLParser which serves as the basis for parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.

是HTML的解析器，不是嚴謹地去解析網頁，它可以處理像不對稱的HTML語法等等，對於網路上各種千奇百怪出錯的網頁來說，當然是選擇可以容錯的 Parser比較好
其運作方式是這樣，使用者覆載(override)一系列的handle_xxx函數，例如handle_data就是負責處理非HTML標籤，也就是不在<>的那些字用的方法，當它分析到這樣的資料就會呼叫handle_data，所以覆載了這個函數就可以處理這些資料，如果你希望可以處理 HTML標籤，也可以覆載handle_startag等等方法。其中xxx表示html tag的類型
from html.parserimport HTMLParser
透過繼承的機制繼承 HTMLParser 類別
定義我們自己的網頁原始碼的解析器類別
依需求覆載(override)一系列的 handle_xxx函數，並實作函式的內容
使用自行定義的類別產生出解析器物件實體
透過呼叫feed()方法將傳入的參數進行語法分析

可附載 (override)的函數

HTMLParser.handle_starttag(tag, attrs)
HTMLParser.handle_endtag(tag)
HTMLParser.handle_startendtag(tag, attrs)
HTMLParser.handle_data(data)
HTMLParser.handle_entityref(name)
HTMLParser.handle_charref(name)
HTMLParser.handle_comment(data)
HTMLParser.handle_decl(decl)
HTMLParser.handle_pi(data)
HTMLParser.unknown_decl(data)

抓取網頁前，需先分析一下Html

Html Tag 類型
以下針對常見的 tag 類型做說明:

starttag

無屬性(attrs)的如: <head>

有包含屬性的: <span class="t18Red">

其中屬性會以 [ ("class", "t18Red" ) ] 形式存放內容

endtag

</head>

有</xxx>為 endtag

startendtag

<meta http-equiv="Content-Type"

content="text/html; charset=utf-8"/>

data

被tag夾住的內容，非任何tag形式稱之為data

範例：以統一發票號碼網頁為例

#-*-coding:UTF-8 -*-

# HTMLParser 範例

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):

def handle_starttag(self, tag, attrs):

print("Encountered a start tag:", tag)

def handle_endtag(self, tag):

print("Encountered an end tag :", tag)

def handle_data(self, data):

print("Encountered some data :", data)

myparser = MyHTMLParser()

myparser.feed('<html><head><title>Test</title></head><body><h1>Parse me!</h1></body></html>')

Out:

Encountered a start tag: html

Encountered a start tag: head

Encountered a start tag: title

Encountered some data : Test

Encountered an end tag : title

Encountered an end tag : head

Encountered a start tag: body

Encountered a start tag: h1

Encountered some data : Parse me!

Encountered an end tag : h1

Encountered an end tag : body

Encountered an end tag : html

HTMLParser 包含以下的方法

HTMLParser.feed(data)
HTMLParser.close()
HTMLParser.reset()
HTMLParser.getpos() print(data,'pos:',self.getpos())
HTMLParser.get_starttag_text()

範例：統一發票號碼解析

到網頁原始碼找關鍵字位置
從發票號碼找出關鍵字為<span class="t18Red">跟</span>

#-*-coding:UTF-8 -*-

# 統一發票號碼抓取範例

import urllib.request

from html.parser import HTMLParser

data = urllib.request.urlopen('http://invoice.etax.nat.gov.tw')

content = data.read().decode('utf_8')

data.close()

class myparser(HTMLParser):

def __init__(self):

HTMLParser.__init__(self)

self.isNumber = 0

self.numbers = []

def handle_data(self, data):

if self.isNumber == 1:

self.numbers.append(data) #self.numbers.extend(data.split('、'))

self.isNumber = 0

print(data, 'pos:', self.getpos()) #印出 data所在網頁中的行、位置

def handle_starttag(self, tag, attrs):

if tag == 'span' and attrs == [('class','t18Red')]: #關鍵字

self.isNumber = 1

def handle_endtag(self,tag):

pass

Parser = myparser()

Parser.feed(content)

print(Parser.numbers)

Out:

18498950 pos: (2, 1388)

08513139 pos: (2, 1505)

21881534、53050416、85174778 pos: (2, 1619)

086 pos: (2, 2176)

51730762 pos: (2, 2682)

67442563 pos: (2, 2799)

11036956、55794786、62610317 pos: (2, 2913)

079 pos: (2, 3470)

['18498950', '08513139', '21881534、53050416、85174778', '086', '51730762', '67442563', '11036956、55794786、62610317', '079']

pos:(行,位置)

改為#self.numbers.extend(data.split('、'))

則輸出為

['18498950', '08513139', '21881534', '53050416', '85174778', '086', '51730762', '67442563', '11036956', '55794786', '62610317', '079']

然後再取前6項，即可取出當月的統一發票號碼(這邊自行嘗試)

URL的解碼與編碼

import urllib.parse
urllib.parse.quote(str)

此方法可將str中的字串轉為url編碼

範例：

In[7]: urllib.parse.quote('美國隊長3')

Out[7]: '%E7%BE%8E%E5%9C%8B%E9%9A%8A%E9%95%B73'

urllib.parse.unquote(str)

將url碼解碼

範例：

In[8]: urllib.parse.unquote('%E7%BE%8E%E5%9C%8B%E9%9A%8A%E9%95%B73')

Out[8]: '美國隊長3'

上述情況僅限簡單的網頁抓取，但網頁有各種功能及語法，以下以動態網頁中的下拉式選單擷取及傳送指令為範例。

高鐵網站
http://www.thsrc.com.tw/tw/TimeTable/SearchResult

首先，要如何知道是POST還是GET

從網頁>>右鍵>>檢查>>Network>>Headers>>Form Data

記得先搜尋一次，才會有資料傳輸。

POST

POST方法是將要傳送的資訊放在message-body中
使用POST方法就不用擔心資料大小的限制，可以防止使用者操作瀏覽器網址，表單的資料被隱藏在message-body中，因此，在大多數的情況下，使用POST方法將表單資料傳到Web Server端是更有效的方法。

GET

GET就是指在網址上指定變數的名稱及變數的值給網頁伺服器，使用GET是有一個上限的所以較不適合用來傳送大量的資料或訊號。
一個簡單的GET就是在網址尾部加上一個問號 ? 之後進行宣告傳送的變數。而傳送的格式是：”變數名稱＝變數的值” 。
若有多個變數需要傳遞則用 & 符號隔開。

範例：

#-*-coding:UTF-8 -*-

# 高鐵站查詢範例

import urllib.request

import urllib.parse

from html.parser import HTMLParser

data = urllib.parse.urlencode({'StartStation': '977abb69-413a-4ccf-a109-0272c24fd490', 'EndStation':'f2519629-5973-4d08-913b-479cce78a356','SearchDate':'2016/05/31','SearchTime':'14:00','SearchWay':'DepartureInMandarin','RestTime':'','EarlyOrLater':''}) #這邊請輸入From Data的當下資訊，否則會顯示網頁不存在，譬如輸入日期是舊的，因為高鐵網站已沒有過去日期的選單

data = data.encode('utf-8')

request = urllib.request.Request('http://www.thsrc.com.tw/tw/TimeTable/SearchResult',data = data,headers = {'User-Agent' :'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'})

response = urllib.request.urlopen(request, data)

html = response.read().decode('utf_8')

print(html,file = open('data.html','w',encoding= 'utf_8') )

此範例會將資料輸出為data.html，打開看後可以看到下拉式選單中的資訊被擷取出來了，不過因為還沒有經過語法分析，所以是將網頁的全部資料 (包括下拉式選單) 都擷取下來。

Header

需讓Python知道網頁瀏覽器是哪一種，所以 Headerˋ中的User-Agent也需要修改，

headers = {'User-Agent' :'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

Python_Note11

Module

一個 Python 檔案 (.py) 檔就是一個模組 (module) ，模組內可以定義變數 (variable) 、函數 (function) 、類別 (class) 或是任何需要的程式內容。

利用關鍵字 (keyword) import ，我們可以在模組裡用其他定義好的模組內容。

模組的使用

建立模組：

建立一個A,py檔
建立第二個B,py檔

載入模組

在第二個檔案輸入import A
亦可使用from ModuleName import FunctionName
Python會將載入的模組編譯成A.pyc於B.py檔旁邊
使用第二個方法可以使用‘*’表示載入所有函數
用help()函數可以將模組定義的函數備註文字輸出
from module import * ，將module中的函數全部匯入，可以直接呼叫函數名稱來使用，不需要用module.function的方式來呼叫。

name

當在第二個檔案使用 import A，程式會先執行A,py模組中所有未縮排的程式碼後，才繼續執行B,py的程式碼。那我們是否應該把執行程式的模組跟放定義的模組分開來呢？大可不必，很多時候我們需要直接測試定義是否合用，因此直接放在相同檔案裡會比較方便，可是當作為其他檔案引入的時候，我們得用一套機制來區分引入跟非引入的情況，這個就得利用 Python 執行環境的預設變數 __name__ 了，當這個變數等於 "__main__" 的時候，就代表目前的模組是執行程式。

範例1：

#-*-coding:UTF-8 -*-

# 模組範例檔_1

import random

if __name__ == '__main__': #被當主程式時才會被執行

print('模組範例檔_1')

class tools:

def generate_random_nums(self,len):

nums=[]

for i in range(len):

nums.append(random.randint(1,100))

return nums

Out:

模組範例檔_1

範例2：

#-*-coding:UTF-8 -*-

# EX08_02_2.py

# 模組範例檔_2

import EX08_02_1

if __name__ == '__main__': #被當主程式時才會被執行

print('模組範例檔_2')

tools = EX08_02_1.tools()

print(tools.generate_random_nums(10))

Out:

模組範例檔_2 #print('模組範例檔_1')非主程式，因此未被執行

[60, 58, 37, 16, 10, 96, 11, 47, 8, 93]

範例3： #使用from

#使用from

from EX08_02_1 import tools

if __name__ == '__main__':

print('模組範例檔_2')

tool = tools()

print(tool.generate_random_nums(10))

範例4： # import...as...用法

# import...as...用法

import EX08_02_1 as ex

if __name__ == '__main__':

print('模組範例檔_2')

tool = ex.tools()

print(tool.generate_random_nums(10))

範例3、4之Out皆同範例2

The Python Standard Library

https://docs.python.org/3.5/library/index.html

OS Module -- Miscellaneous operating system interfaces

https://docs.python.org/3/library/os.html#module-os

提供顯示系統環境參數與指令功能函數

os.rename(src, dst, *, src_dir_fd=None, dst_dir_fd=None)

—對檔案或目錄更換名稱
—src引數是原本的資料夾
dst引數是修改後的資料夾名稱

os.renames('old.txt','new.txt'),'d:\\temp'

os.remove(path, *, dir_fd=None)

—移除檔案
path引數傳入檔案位置
不會移除資料夾

os.removedirs(name)

移除空的資料夾

os.listdir(path='.')

—os.chdir(path)

os.getcwd()

os.mkdir(path[, mode])

建立資料夾
path引數是建立 /刪除目錄的位置
mode引數是Unix平台使用的

os.rmdir(path)

os.path.getsize(path)

取得檔案大小

範例：

In[24]: os.path.getsize('C:\\Qt\\Qt5.5.1\\InstallationLog.txt')

Out[24]: 249342 #單位為byte

os.path.getctime(path)

取得檔案的建立日期

—os.path.getmtime(path)

取得檔案的修改日期

os.path.getatime(pah)

取得檔案的存取日期

範例：

In[25]: os.path.getctime('C:\\Qt\\Qt5.5.1\\InstallationLog.txt')

Out[25]: 1450242104.5591452 #與1970年01月01號00時00分00秒的秒差值

os.path.isfile(path)

判斷傳入的path引數是否為檔案

os.path.isdir(path)

random Module -- Generate pseudo-random numbers

https://docs.python.org/3/library/random.html#module-random—

random.random()

隨機產生0.0<=x<1.0之間的數字

範例：

In[33]: random.random()

Out[33]: 0.6161858237684328

random.uniform(num1, num2)

隨機產生num1<=x<num2的數字

範例：

In[34]: random.uniform(10,20)

Out[34]: 14.398467660358468

random.randint(1,10)

產生一個1到10的數字

範例：

In[35]: random.randint(5,10)

Out[35]: 8

random.randrange(0,101)

產生一個0到100的數字

範例：

In[36]: random.randrange(0,101)

Out[36]: 66

random.choice()

將傳入的內容隨機取得

範例:

In[38]: members = ['Cash','Mary','Tom','Victor','Han']

In[39]: random.choice(members)

Out[39]: 'Tom'

In[40]: random.choice(members)

Out[40]: 'Tom'

In[41]: random.choice(members)

Out[41]: 'Tom'

In[42]: random.choice(members)

Out[42]: 'Victor'

In[43]: random.choice(members)

Out[43]: 'Cash'

random.shuffle()

將傳入的序列型態資料內的項目順序以隨機順序產生

範例：

In[44]: members = ['Cash','Mary','Tom','Victor','Han']

In[45]: random.shuffle(members)

In[46]: members

Out[46]: ['Victor', 'Mary', 'Cash', 'Han', 'Tom']

random.sample()

將傳入的序列項目以指定的長度顯示一個隨機項目值的序列

範例：

In[47]: members = ['Cash','Mary','Tom','Victor','Han']

In[48]: random.sample(members,3)

Out[48]: ['Cash', 'Victor', 'Mary']

In[49]: random.sample(members,3)

Out[49]: ['Victor', 'Cash', 'Han']

In[50]: random.sample(members,3)

Out[50]: ['Han', 'Mary', 'Victor']

time Module -- Time access and conversions

https://docs.python.org/3/library/time.html#module-time

time.time()

取得系統時間

範例：

In[51]: import time

In[52]: time.time()

Out[52]: 1463400683.5634313 #單位為秒.xxxxx

In[53]: time.time()

Out[53]: 1463400690.88585

time.sleep(num)

設定暫停時間
num = 停頓的秒數

範例：

#倒數十秒

In[68]: for i in range(10,-1,-1):

... print(i)

... time.sleep(1)

time.localtime()

取得當地時間

回傳的格式如下：
time.struct_time(tm_year, tm_mon, tm_mday, tm_hour,
tm_min, tm_sec, tm_wday, tm_yday_, tm_isdst)

範例1：

In[54]: time.localtime()

Out[54]: time.struct_time(tm_year=2016, tm_mon=5, tm_mday=16, tm_hour=20, tm_min=12, tm_sec=47, tm_wday=0, tm_yday=137, tm_isdst=0)

範例2：

In[55]: t = time.localtime()

In[56]: t.tm_hour

Out[56]: 20

In[57]: t.tm_mon

Out[57]: 5

time.gmtime()

取得UTC時間

time.strftime()

設定時間格式以字串形式輸出

time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())

範例：

In[63]: time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())

Out[63]: '2016-05-16 20:20:22'

sys Module -- System-specific parameters and functions

https://docs.python.org/3/library/sys.html#module-sys

sys.argv[0]

會回傳此程式檔案的位置與名稱

sys.argv
帶入參數以list形式儲存
sys.builtin_module_names

回傳Python程式語言內所有內置模組名稱

sys.modules.keys()

得知目前已經載入的模組

sys.platform

取得目前作業系統的版本

sys.exit()

宣告sys.exit(0)終止程式

sys.version

回傳目前安裝在系統上的Python版本
格式：’(#build_number, build_date, build_time)[compiler]’

sys.api_version

回傳Python直譯器的C API版本

sys.version_info

回傳一個tuple型態的值
(‘主要版本’, ’次要版本’, ’小版本’)

sys.winver

回傳的版本數字是註冊在Windows裡的Python版本

sys.path

定義Python搜尋模組的路徑

shutil Module -- High-level file operations

https://docs.python.org/3/library/shutil.html#module-shutil
比較高階的應用層，提供數個針對檔案操作的功能
shutil = shell + until

shutil.copytree(src, dst)

shutil.copy(src, dst)

複製檔案

shutil.rmtree(path)

shutil.move(src, dst)

移動檔案，移動時也可以進行更換檔案名稱

shutil.copystat(src, dst)

複製檔案，會連同檔案屬性一同複製

ertools.combinations(iterable, r)

Return r length subsequences of elements from the input iterable.

範例：In[38]: for a in combinations('ABCD',2):

... print(a)

...

Out：

('A', 'B')

('A', 'C')

('A', 'D')

('B', 'C')

('B', 'D')

('C', 'D')

Python_Note10

print()

print(*objects, sep=' ', end='\n', file=sys.stdout, flush=False

內建函數 (function) print() ，印出參數 (parameter) object 的內容，還可以使用 sep指定每個輸出值之間的分隔字元，預設值為一個空白 ，可以使用 end指定輸出後最後一個字元，預設值是'\n'(換行)， file 為輸出串流裝置，預設為 sys.stdout ，即是標準輸出裝置，通常會是螢幕，可以使用 file指定至其它的輸出。例如以下會將指定的值輸出至 data.txt：

範例：

#-*-coding:UTF-8 -*-

# print 寫檔範例

text = '''食譜：芒果冰沙

1.將愛文芒果去皮，切小塊備料

2.將冰塊放進果汁機，再放入其他所有食材(可留部分芒果塊，放置冰沙上方)

3.開啟果汁機，以漸進式方式快速打勻，直到冰塊完全打成冰沙即可

4.芒果冰沙裝杯後，放入芒果塊於冰沙上方，可增加芒果冰沙的口感喔!'''

print(text ,file=open('data.txt','w',encoding='utf-8'))

其中

data.txt : 檔名
'w' : write，複寫。'a' : append，加寫在後面
encoding = 'utf-8' : 編碼

open()

將資料寫入檔案或從檔案讀出，可以使用open()函式：

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)

Character	Meaning
`'r'`	open for reading (default)
`'w'`	open for writing, truncating the file first
`'x'`	open for exclusive creation, failing if the file already exists
`'a'`	open for writing, appending to the end of the file if it exists
`'b'`	binary mode
`'t'`	text mode (default)
`'+'`	open a disk file for updating (reading and writing)
`'U'`	universal newlines mode (deprecated)

read()
一次讀取所有的檔案內容，在不使用檔案時，可以使用close()將檔案關閉以節省資源

範例：

#-*-coding:UTF-8 -*-

# file 讀檔範例 read()

file = open('dream.txt', 'r', encoding='UTF-8')

content = file.read()

print(content)

file.close()

readline()
一次讀取一行內容。其中，讀取類似POP，會將第一行pop出來後，讀取的file檔會剩下第二後之後的。但原始檔並不會被修改到。

範例：

#-*-coding:UTF-8 -*-

# file 讀檔範例 readline()

file = open('dream.txt', 'r', encoding='UTF-8')

while True:

line = file.readline()

if not line:

break

print(line, end='')

file.close()

readlines()
用一個串列收集讀取的每一行

範例：

#-*-coding:UTF-8 -*-

# file 讀檔範例 readlines()

file = open('dream.txt', 'r', encoding='UTF-8')

for line in file.readlines():

print(line, end='')

file.close()

使用open()函式時，指定模式為‘w’或’a’，並使用write()方法進行資料寫入。傳入參數需為字串型態
範例1：使用write()函數

#-*-coding:UTF-8 -*-

# file 寫檔範例 write()

# 亂數產生10個整數(1~1000)，寫入檔案中

import random

file = open('rand_num.txt', 'w', encoding = 'UTF-8')

for i in range(10):

file.write( str(random.randint(1,1000))+'\n' )

file.close()

範例2：print()直接寫檔

#-*-coding:UTF-8 -*-

# print 寫檔範例

text = '''食譜：芒果冰沙

1.將愛文芒果去皮，切小塊備料

2.將冰塊放進果汁機，再放入其他所有食材(可留部分芒果塊，放置冰沙上方)

3.開啟果汁機，以漸進式方式快速打勻，直到冰塊完全打成冰沙即可

4.芒果冰沙裝杯後，放入芒果塊於冰沙上方，可增加芒果冰沙的口感喔!'''

print(text ,file=open('data.txt','w',encoding='utf-8'))

Python_Note08

物件導向程式設計 (object-oriented programming) 有三大基本特性，分別是封裝 (encapsulation) 、繼承 (inheritance) 及多型 (polymorphism)

Encapsulation

封裝 (Encapsulation)的意思就是把屬性 (attribute) 封在類別中，這還牽涉到程式設計中另一個重要的概念資訊隱藏 (information hiding) ，主要就是不讓外界隨意存取類別的屬性，也就是說，只讓類別的屬性給同個類別的方法 (method) 存取。

Python 類別 (class) 的屬性 (attribute) 權限預設是公開的，因此類別以外的地方也可以存取。

範例：

class Demo:

x = 0

def __init__(self, i):

self.i = i

Demo.x += 1

def __str__(self):

return str(self.i)

def hello(self):

print("hello", self.i)

a = Demo(9527)

a.hello()

print("hello", a.i)

print()

print("a.i =", a.i)

print("Demo.x =", Demo.x)

Out:
hello 9527

hello 9527

a.i = 9527

Demo.x = 1

print("hello", a.i)
print("a.i =", a.i)
print("Demo.x =", Demo.x)

上述三個類別以外的地方，直接以句點運算子 . 存取屬性值，然而有時候我們希望做到良好的資訊隱藏 (information hiding) ，也就是說我們不要讓類別定義以外的地方去存取其屬性值，這時，我們可以將屬性設定為私有的，簡單來說，就是在屬性識別字 (identifier) 名稱前加上連續兩個底線符號，ex: __x。

如此一來，若是要在類別以外的地方使用屬性值，需要另外可存取的公開方法 (method) 。

Class method

類別方法 (class method)需要一個特別的參數 (parameter) ，習慣上使用 cls ，這與實體方法的 self 類似，不同的是 cls 用來存取類別的屬性 (attribute)。

例如:

class Demo:

__x = 0

def __init__(self, i):

self.__i = i

Demo.__x += 1

def __str__(self):

return str(self.__i)

def hello(self):

print("hello", self.__i)

@classmethod

def getX(cls): #利用類別方法 get(x)存取類別屬性__x

return cls.__x

a = Demo(9527)

a.hello()

print("Demo.__x =", Demo.getX())

#若修改成print(''Demo.__x = '', Demo.__x) 來直接存取Demo.__x，則會有Error：AttributeError: type object 'Demo' has no attribute '__X'

Out:

hello 9527

Demo.__x = 1

實體方法須先建立一個物件才能呼叫，
Ex:
a = Demo('Tom')
Demo.hello()

類別方法則不用，可以直接呼叫
不需要建立物件a = Demo('Tom')
直接用print(Demo.getX())即可
類別方法預設也都是公開的，若要定義私有的類別方法，也就是只能在類別內呼叫的方法，同樣在方法識別字名稱前加上連續兩個底線符號，這樣的類別方法就變成私有的。

Inheritance

若定義了很多類別 (class) ，這些類別中又具有相當多相同的屬性 (attribute) 或方法 (method) 定義，這時候，可利用 Python 的繼承 (inheritance) 機制，將共通的屬性及方法提取出來，另行定義父類別(superclass) ，然後將原本提出共通屬性及方法改為繼承 (inherit) 父類別的子類別 (subclass) 。

這是從 SubDemo 類別去繼承 Demo ，注意類別名稱後的小括弧中註明父類別。

範例:

class Demo:

__x = 0

def __init__ (self, i):

self.__i = i

Demo.__x += 1

def hello(self):

print("hello", self.__i)

@classmethod

def getX(cls):

return cls.__x

@classmethod

def add(cls):

Demo.__x +=1

class subDemo(Demo):

pass

#子類別 SubDemo 裡只用了一個 pass 陳述 (statement) ，

#這會使 SubDemo 的內容與父類別 Demo 完全一樣。

a = Demo("Tom")

a.hello()

b = subDemo("John")

b.hello()

print("Demo.x =", Demo.getX())

Out:

hello Tom

hello John

Demo.x = 2

isinstance(object, classinfo)

內建函數 (function) isinstance() 可以判斷某一個物件是否為某一個類別所建構的實體 (instance) ，若真則回傳 True ，否則回傳 False。

https://docs.python.org/3/library/functions.html#isinstance
Return true if the object argument is an instance of the classinfo argument, or of a (direct, indirect or virtual) subclass thereof. If object is not an object of the given type, the function always returns false. If classinfo is a tuple of type objects (or recursively, other such tuples), return true if object is an instance of any of the types. If classinfo is not a type or tuple of types and such tuples, a TypeError exception is raised.

issubclass(class, classinfo)

內建函數 issubclass() 則可以判斷某一個類別是否為另一個類別的子類別，同樣的，若真則回
傳 True ，否則回傳 False 。

https://docs.python.org/3/library/functions.html#issubclass

Return true if class is a subclass (direct, indirect or virtual) of classinfo. A class is considered a subclass of itself. classinfo may be a tuple of class objects, in which case every entry in classinfo will be checked. In any other case, a TypeError exception is raised.

範例：
上一個範例最後修改成：

a = Demo("Tom")

b = subDemo("John")

#isinstance

print(isinstance(a, Demo))

print(isinstance(a, subDemo))

print(isinstance(b, Demo))

print(isinstance(b, subDemo))

#issubclass

print(issubclass(subDemo, Demo))

print(issubclass(Demo, subDemo))

Out:

True

False

True

False

#變數 (variable) b 雖然是由 SubDemo 建立的，但是 b 也會是 Demo 的實體，這是由於物件實體的建構過程中，會先建立父類別的部份，因此也會建立屬於 b 的父類別物件實體，使 b 得以運用父類別的屬性及方法。

子類別方法改寫

子類別 (subclass) 可依本身特性設定自己的屬性 (attribute) 與方法 (method) ，也會從父類別 (superclass) 繼承 (inherit) 屬性與方法。一般來說，沒有設定成私有的屬性及方法都會被繼承，子類別可由父類別公開的方法存取父類別私有的屬性。

子類別也可依需要改寫 (override) 父類別的方法，這是說子類別需要用到與父類別具有相同名稱的方法，但是子類別需要的功能有所修改、擴充或增加，因此當子類別裡頭定義與父類別相同名稱的方法時，就會改寫父類別的方法。經過改寫，子類別的方法完全屬於子類別所有。

範例：

class Demo:

__x = 0

def __init__(self, i):

self.__i = i

Demo.__x += 1

def __str__(self):

return str(self.__i)

def hello(self):

print("hello " + self.__str__())

@classmethod

def getX(cls):

return cls.__x

class SubDemo(Demo):

def __init__(self, i, j):

self.__i = i

self.__j = j

def __str__(self):

return str(self.__i) + "+" + str(self.__j)

a = SubDemo(12,34)

a.hello()

print("a.__x =", a.getX())

b = SubDemo(56, 78)

b.hello()

print("b.__x =", b.getX())

print()

print("a.__x =", a.getX())

print("b.__x =", b.getX())

Out:

hello 12+34

a.__x = 0

hello 56+78

b.__x = 0

a.__x = 0

b.__x = 0

Demo 為父類別，定義四個方法， SubDemo 為子類別，改寫 Demo 的兩個方法，包括 __init__() 與 __str__() 。

我們可以發現， Demo 有個 __x 變數，原本用來算有多少 Demo 實體 (instance) 被建立，我們想這應該要包括子類別 SubDemo 的實體數量，也就是說， __x 應該等於 Demo 實體總數加上 SubDemo 實體總數。此例中我們建立兩個 SubDemo 實體，可是 __x 卻等於 0 。

因為此例中 Demo 的 __init__() 方法從頭到尾沒有被呼叫過，因此 __x 始終保持為初值 0 。解決這個問題的方法很簡單，一個幾單的途徑是在子類別 SubDemo 新增一個 __x ，但是這樣一來就只能累計 SubDemo 的實體數量，若是還有其他子類別繼承自 Demo ，這就無法一同列入計算了。

另一個解決途徑是在子類別改寫的方法中先呼叫 (call) 父類別的方法，利用內建函數

super()

class Demo:

__x = 0

def __init__(self, i):

self.__i = i

Demo.__x += 1

def __str__(self):

return str(self.__i)

def hello(self):

print("hello " + self.__str__())

@classmethod

def getX(cls):

return cls.__x

class SubDemo(Demo):

def __init__(self, i, j):

super().__init__(i) #SubDemo 的 __init__() 定義中，我們利用

self.__j = j # super() 呼叫父類別 Demo 的 __init__() ，

#因此需提供 i 當作參數 (parameter) 。

#注意，這裡 self.__i 變成父類別的私有屬性。

def __str__(self):

return super().__str__() + "+" + str(self.__j)

#在 return 後的運算式 (expression) 先呼叫 super().__str__() ，

#因為 self.__i 已經變成父類別 Demo 私有屬性，

#因此需要先呼叫父類別的 __str__() 。

a = SubDemo(12, 34)

a.hello()

print("a.__x =", a.getX())

b = SubDemo(56, 78)

b.hello()

print("b.__x =", b.getX())

print()

print("a.__x =", a.getX())

print("b.__x =", b.getX())

Out:

hello 12+34

a.__x = 1

hello 56+78

b.__x = 2

a.__x = 2

b.__x = 2

Multiple Inheritance

設計類別 (class) 時，父類別 (superclass) 可以有多個，這是說子類別 (subclass) 能夠多重繼承 (multiple inherit) 多個父類別，使子類別可以有多種特性。

這裡須注意一點，當子類別繼承 (inheritance) 超過一個來源的時候，會以寫在最左邊的父類別優先繼承，這是說，多個父類別如果有相同名稱的屬性 (attribute) 與方法 (method) ，例如 __init__() 、 __str__() 等，就會以最左邊的父類別優先。語法如下：

class SubClass( SuperClass1, SuperClass2):

pass

因此，如果 SuperClass1 有 __init__() 、 __str__() 等， SubClass 就會繼承 SuperClass1 的 __init__() 及 __str__() ，而 SuperClass2 的 __init__() 、 __str__() 不會出現在 SubClass 之中。
範例：

#多重繼承

class Demo:

__x = 0

def __init__(self, i):

self.__i = i

Demo.__x += 1

def hello(self):

print("hello", self.__i)

@classmethod

def getX(cls):

return cls.__x

@classmethod

def add(cls):

Demo.__x +=1

class Demo2:

def __init__(self, i):

self.__i = i

def reverseString(self,string):

reverse=''

for i in range(len(string)-1, -1, -1):

reverse += string[i]

return reverse

class subDemo(Demo,Demo2):

def __init__(self, i, j="guest"):

super().__init__(i)

self.__i = i

self.__j = j

def hello(self):

print("hello", self.__i,self.__j)

def superHello(self):

super().__init__(self.__i)

super().hello()

a = subDemo("Tom")

print(a.reverseString("Tom"))

print("Demo.x =", Demo.getX())

Out:

moT

Demo.x = 1

del()

建構子 (constructor) 用來建立物件 (object) ，當物件不需要被使用時，直譯器 (interpreter) 會主動替物件呼叫 __del__() 方法 (method) ，這是物件自動銷毀的方法，也就是從記憶體中釋放空間的步驟，被稱為解構子 (destructor) ，當然，我們也可以改寫 (override) 這個方法。

範例：

class Demo:

def __init__(self, i):

self.i = i

def __str__(self):

return 'hello world'

def __del__(self):

print("del called: " + self.__str__())

def hello(self):

print("hello " + self.__str__()) #重新定義override

a = Demo("Tommy")

a.hello()

print( str(a) )

Out:

hello hello world

hello world

del called: hello world

我們只有使用變數 (variable) a 一個名稱，利用建構子 Demo() 建立物件後呼叫 hello() ，然後重新呼叫 Demo() 建立另一個 Demo 型態的物件，我們可以看到直譯器主動呼叫 del () ，印出 "del called" 的訊息。最後程式結束執行前，直譯器同樣主動呼叫最後建立物件解構子，完全釋放所使用的記憶體空間。

Polymorphism

多型 (polymorphism) 是物件導向程式語言 (object-oriented programming language) 的一項主要特性，使物件 (object) 的使用更具彈性。簡單來說，多型可使物件的型態具有通用的效力，例如以下程式：

Out:

Missy: Meow!

Garfield: Meow!

Lassie: Woof! Woof!

其他範例：

d1 = '1,2,3,4,5' #d1只能輸入str

d2 = [1,2,3,4,'5'] #d2可以輸入list, float等，取決於是甚麼型態去使用coun

print(d1.count('4'))

print(d2.count('4'))

Out:

d1 為字串 (string) ， d2 為串列 (list) ，兩者皆屬於序列 (sequence) 的複合資料型態 (compound data type)，有通用的 count() 方法，可計算某元素 (element) 累計出現的次數。

多型的應用很多，例如串列中可接受不同型態的物件當元素，或是方法可用不同型態的參數等。

Python_Note07

Class

類別 (class) 為物件 (object) 設計的模板， Python 裡所有東西都是物件，凡是物件都有屬性 (attribute) 跟方法 (method) 。所謂的屬性雷同變數 (variable) ，專屬於物件，方法類似函數 (function) ，同樣專屬於物件。

定義類別使用關鍵字 (keyword) class，形式如下

範例：

方法會預設一個識別字 (identifier) self 來代表物件本身自己，也可以自己取第一個參數的名稱，不過習慣上都用 self 。

內建函數 help() 會顯示物件（包括類別）的資訊， dir() 則會把物件（包括類別）的所有屬性與方法以串列 (list) 列出

建立物件

Special Methods

https://docs.python.org/3/reference/datamodel.html#special-method-names
類別中有一些以雙底線開始並且以雙底線結束的方法，稱之為類別的專有方法 (Special Method)，專有方法是針對類別的特殊操作的一些方法，例如__init__方法。

init ()

類別__init__(self)：雷同建構子 (constructor)，用來放置初始值。
利用建構子 (constructor) 建立的物件被稱為實體(instance) ，實際上建立物件的整個過程是執行init () 方法 (method) 。自行定義的類別會有預先定義好的 init () ，我們也可以改寫 (override) 成我們自己需要的。
改寫方式就是再定義一次，方法的定義與函數(function) 類似，兩者同樣使用關鍵字 (keyword) def。

範例：

class Demo:

def __init__(self):

self.name = 'Python' #self.name 如同this.name

def hello(self):

print('hello',self.name)

d = Demo()

print(type(Demo))

print(type(d))

print('d.name : %s'%d.name)

d.hello()

Out:

d.name : Python

hello Python

self

凡是實體的方法都需要一個特別的參數 -- self ，self 是個預設的保留字 (reserved word) ，所指的是實體本身自己，在 __init__() 所定義的實體屬性 (attribute) 都需要額外加上 self ，如第 3 行的 self.name，使用self.name 就可以使用到物件自己。self非關鍵字，所以可以改別的名稱。

self不是keyword的原因參考創作者Guido van Rossum的解釋
http://neopythonic.blogspot.tw/2008/10/why-explicit-self-has-to-stay.html

類別__init__()參數的設定

class Demo:

def __init__(self.name): #name : 參數

self.name = name

def hello(self):

print('hello',self.name)

d = Demo('Tom') #Tom : 引數

print(type(Demo))

print(type(d))

print('d.name : %s'%d.name)

d.hello()

Out:

d.name : Tom

hello Tom

doc

類別 (class) 有 __doc__ 屬性 (attribute) ，這是三引號字串定義的文字，屬於類別的說明文件。

範例：

class Demo:

'''

Demo Document:

hello python

'''

def __init__ (self,name):

self.name = name

def hello(self):

print("hello",self.name)

d = Demo("Tom")

print(d.__doc__)

Out:

Demo Document:

hello python

Attribute

Python 類別 (class) 的屬性 (attribute) 有兩種，一種是類別屬性 (class attribute) ，另一種是實體屬性 (instance attribute)

範例：

通常類別屬性需要用類別名稱來存取，但若兩者的識別字 (identifier) 不同，實體物件 (object) 也可以存取類別屬性。且若是類別屬性與實體屬性的識別字相同，則實體物件只能存取實體屬性。

範例 (類別屬性與實體屬性的識別字不同)：

class Demo:

x = 0

def __init__(self, i):

self.i = i

Demo.x += 1

def __str__(self):

return str(self.i)

def hello(self):

print("hello", self.i)

print("There are", Demo.x, "instances.")

a = Demo(1122)

a.hello()

print("a.x =", a.x)

b = Demo(3344)

b.hello()

print("b.x =", b.x)

c = Demo(5566)

c.hello()

print("c.x =", c.x)

d = Demo(7788)

d.hello()

print("d.x =", d.x)

e = Demo(9900)

e.hello()

print("e.x =", e.x)

print("After all, there are", Demo.x, "instances.")

Out:

There are 0 instances.

hello 1122

a.x = 1

hello 3344

b.x = 2

hello 5566

c.x = 3

hello 7788

d.x = 4

hello 9900

e.x = 5

After all, there are 5 instances.

範例 (類別屬性與實體屬性的識別字相同)：

class Demo:

i = 0

def __init__(self, i):

self.i = i

Demo.i += 1

def __str__(self):

return str(self.i)

def hello(self):

print("hello", self.i)

print("There are", Demo.i, "instances.")

a = Demo(1122)

a.hello()

print("a.i =", a.i)

b = Demo(3344)

b.hello()

print("b.i =", b.i)

c = Demo(5566)

c.hello()

print("c.i =", c.i)

d = Demo(7788)

d.hello()

print("d.i =", d.i)

e = Demo(9900)

e.hello()

print("e.i =", e.i)

print("After all, there are", Demo.i, "instances.")

Out:

There are 0 instances.

hello 1122

a.i = 1122

hello 3344

b.i = 3344

hello 5566

c.i = 5566

hello 7788

d.i = 7788

hello 9900

e.i = 9900

After all, there are 5 instances.

訂閱：意見 (Atom)

2016年5月31日 星期二

Python_Note12

網頁抓取與解析

urllib

urllib is a package that collects several modules for working with URLs: urllib.request for opening and reading URLs urllib.error containing the exceptions raised by urllib.request urllib.parse for parsing URLs urllib.robotparser for parsing robots.txt files

HTMLparser

URL的解碼與編碼

POST

GET

Python_Note11

Module

__name__

The Python Standard Library

OS Module -- Miscellaneous operating system interfaces

random Module -- Generate pseudo-random numbers

time Module -- Time access and conversions

sys Module -- System-specific parameters and functions

shutil Module -- High-level file operations

Python_Note10

print()

open()

Python_Note08

Encapsulation

Class method

Inheritance

isinstance(object, classinfo)

issubclass(class, classinfo)

子類別方法改寫

super()

Multiple Inheritance

__del__()

Polymorphism

Python_Note07

Class

Special Methods

init ()

self

__doc__

Attribute

2016年5月31日星期二

`urllib` is a package that collects several modules for working with URLs:

`urllib.request` for opening and reading URLs

`urllib.error` containing the exceptions raised by `urllib.request`

`urllib.parse` for parsing URLs

`urllib.robotparser` for parsing `robots.txt` files

name

del()

doc