<span id="7ztzv"></span>
<sub id="7ztzv"></sub>

<span id="7ztzv"></span><form id="7ztzv"></form>

<span id="7ztzv"></span>

        <address id="7ztzv"></address>

            第 9 章 XML 處理

            9.1. 概覽

            下面兩章是關于 PythonXML 處理的。如果你已經對 XML 文檔有了一個大概的了解,比如它是由結構化標記構成的,這些標記形成了層次模型的元素,等等這些知識都是有幫助的。如果你不明白這些,這里有很多 XML 教程能夠解釋這些基礎知識。

            如果你對 XML 不是很感興趣,你還是應該讀一下這些章節,它們涵蓋了不少重要的主題,比如 Python 包、Unicode、命令行參數以及如何使用 getattr 進行方法分發。

            如果你在大學里主修哲學 (而不是像計算機科學這樣的實用專業),并且曾不幸地被伊曼努爾·康德的著作折磨地夠嗆,那么你會非常欣賞本章的樣例程序。(這當然不意味著你必須修過哲學。)

            處理 XML 有兩種基本的方式。一種叫做 SAX (“Simple API for XML”),它的工作方式是,一次讀出一點 XML 內容,然后對發現的每一個元素調用一個方法。(如果你讀了 第 8 章 HTML 處理,這應該聽起來很熟悉,因為這是 sgmllib 工作的方式。) 另一種方式叫做 DOM (“Document Object Model”),它的工作方式是,一次性讀入整個 XML 文檔,然后使用 Python 類創建一個內部表示形式 (以樹結構進行連接)。Python 擁有這兩種解析方式的標準模塊,但是本章只涉及 DOM

            下面是一個完整的 Python 程序,它根據 XML 格式定義的上下文無關語法生成偽隨機輸出。如果你不明白是什么意思,不用擔心,下面兩章中將會深入檢視這個程序的輸入和輸出。

            例 9.1. kgp.py

            如果您還沒有下載本書附帶的樣例程序, 可以 下載本程序和其他樣例程序

            """Kant Generator for Python
            
            Generates mock philosophy based on a context-free grammar
            
            Usage: python kgp.py [options] [source]
            
            Options:
              -g ..., --grammar=...   use specified grammar file or URL
              -h, --help              show this help
              -d                      show debugging information while parsing
            
            Examples:
              kgp.py                  generates several paragraphs of Kantian philosophy
              kgp.py -g husserl.xml   generates several paragraphs of Husserl
              kpg.py "<xref id='paragraph'/>"  generates a paragraph of Kant
              kgp.py template.xml     reads from template.xml to decide what to generate
            """
            from xml.dom import minidom
            import random
            import toolbox
            import sys
            import getopt
            
            _debug = 0
            
            class NoSourceError(Exception): pass
            
            class KantGenerator:
                """generates mock philosophy based on a context-free grammar"""
            
                def __init__(self, grammar, source=None):
                    self.loadGrammar(grammar)
                    self.loadSource(source and source or self.getDefaultSource())
                    self.refresh()
            
                def _load(self, source):
                    """load XML input source, return parsed XML document
            
                    - a URL of a remote XML file ("http://diveintopython.org/kant.xml")
                    - a filename of a local XML file ("~/diveintopython/common/py/kant.xml")
                    - standard input ("-")
                    - the actual XML document, as a string
                    """
                    sock = toolbox.openAnything(source)
                    xmldoc = minidom.parse(sock).documentElement
                    sock.close()
                    return xmldoc
            
                def loadGrammar(self, grammar):                         
                    """load context-free grammar"""                     
                    self.grammar = self._load(grammar)                  
                    self.refs = {}                                      
                    for ref in self.grammar.getElementsByTagName("ref"):
                        self.refs[ref.attributes["id"].value] = ref     
            
                def loadSource(self, source):
                    """load source"""
                    self.source = self._load(source)
            
                def getDefaultSource(self):
                    """guess default source of the current grammar
                    
                    The default source will be one of the <ref>s that is not
                    cross-referenced.  This sounds complicated but it's not.
                    Example: The default source for kant.xml is
                    "<xref id='section'/>", because 'section' is the one <ref>
                    that is not <xref>'d anywhere in the grammar.
                    In most grammars, the default source will produce the
                    longest (and most interesting) output.
                    """
                    xrefs = {}
                    for xref in self.grammar.getElementsByTagName("xref"):
                        xrefs[xref.attributes["id"].value] = 1
                    xrefs = xrefs.keys()
                    standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
                    if not standaloneXrefs:
                        raise NoSourceError, "can't guess source, and no source specified"
                    return '<xref id="%s"/>' % random.choice(standaloneXrefs)
                    
                def reset(self):
                    """reset parser"""
                    self.pieces = []
                    self.capitalizeNextWord = 0
            
                def refresh(self):
                    """reset output buffer, re-parse entire source file, and return output
                    
                    Since parsing involves a good deal of randomness, this is an
                    easy way to get new output without having to reload a grammar file
                    each time.
                    """
                    self.reset()
                    self.parse(self.source)
                    return self.output()
            
                def output(self):
                    """output generated text"""
                    return "".join(self.pieces)
            
                def randomChildElement(self, node):
                    """choose a random child element of a node
                    
                    This is a utility method used by do_xref and do_choice.
                    """
                    choices = [e for e in node.childNodes
                               if e.nodeType == e.ELEMENT_NODE]
                    chosen = random.choice(choices)            
                    if _debug:                                 
                        sys.stderr.write('%s available choices: %s\n' % \
                            (len(choices), [e.toxml() for e in choices]))
                        sys.stderr.write('Chosen: %s\n' % chosen.toxml())
                    return chosen                              
            
                def parse(self, node):         
                    """parse a single XML node
                    
                    A parsed XML document (from minidom.parse) is a tree of nodes
                    of various types.  Each node is represented by an instance of the
                    corresponding Python class (Element for a tag, Text for
                    text data, Document for the top-level document).  The following
                    statement constructs the name of a class method based on the type
                    of node we're parsing ("parse_Element" for an Element node,
                    "parse_Text" for a Text node, etc.) and then calls the method.
                    """
                    parseMethod = getattr(self, "parse_%s" % node.__class__.__name__)
                    parseMethod(node)
            
                def parse_Document(self, node):
                    """parse the document node
                    
                    The document node by itself isn't interesting (to us), but
                    its only child, node.documentElement, is: it's the root node
                    of the grammar.
                    """
                    self.parse(node.documentElement)
            
                def parse_Text(self, node):    
                    """parse a text node
                    
                    The text of a text node is usually added to the output buffer
                    verbatim.  The one exception is that <p class='sentence'> sets
                    a flag to capitalize the first letter of the next word.  If
                    that flag is set, we capitalize the text and reset the flag.
                    """
                    text = node.data
                    if self.capitalizeNextWord:
                        self.pieces.append(text[0].upper())
                        self.pieces.append(text[1:])
                        self.capitalizeNextWord = 0
                    else:
                        self.pieces.append(text)
            
                def parse_Element(self, node): 
                    """parse an element
                    
                    An XML element corresponds to an actual tag in the source:
                    <xref id='...'>, <p chance='...'>, <choice>, etc.
                    Each element type is handled in its own method.  Like we did in
                    parse(), we construct a method name based on the name of the
                    element ("do_xref" for an <xref> tag, etc.) and
                    call the method.
                    """
                    handlerMethod = getattr(self, "do_%s" % node.tagName)
                    handlerMethod(node)
            
                def parse_Comment(self, node):
                    """parse a comment
                    
                    The grammar can contain XML comments, but we ignore them
                    """
                    pass
                
                def do_xref(self, node):
                    """handle <xref id='...'> tag
                    
                    An <xref id='...'> tag is a cross-reference to a <ref id='...'>
                    tag.  <xref id='sentence'/> evaluates to a randomly chosen child of
                    <ref id='sentence'>.
                    """
                    id = node.attributes["id"].value
                    self.parse(self.randomChildElement(self.refs[id]))
            
                def do_p(self, node):
                    """handle <p> tag
                    
                    The <p> tag is the core of the grammar.  It can contain almost
                    anything: freeform text, <choice> tags, <xref> tags, even other
                    <p> tags.  If a "class='sentence'" attribute is found, a flag
                    is set and the next word will be capitalized.  If a "chance='X'"
                    attribute is found, there is an X% chance that the tag will be
                    evaluated (and therefore a (100-X)% chance that it will be
                    completely ignored)
                    """
                    keys = node.attributes.keys()
                    if "class" in keys:
                        if node.attributes["class"].value == "sentence":
                            self.capitalizeNextWord = 1
                    if "chance" in keys:
                        chance = int(node.attributes["chance"].value)
                        doit = (chance > random.randrange(100))
                    else:
                        doit = 1
                    if doit:
                        for child in node.childNodes: self.parse(child)
            
                def do_choice(self, node):
                    """handle <choice> tag
                    
                    A <choice> tag contains one or more <p> tags.  One <p> tag
                    is chosen at random and evaluated; the rest are ignored.
                    """
                    self.parse(self.randomChildElement(node))
            
            def usage():
                print __doc__
            
            def main(argv):                         
                grammar = "kant.xml"                
                try:                                
                    opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="])
                except getopt.GetoptError:          
                    usage()                         
                    sys.exit(2)                     
                for opt, arg in opts:               
                    if opt in ("-h", "--help"):     
                        usage()                     
                        sys.exit()                  
                    elif opt == '-d':               
                        global _debug               
                        _debug = 1                  
                    elif opt in ("-g", "--grammar"):
                        grammar = arg               
                
                source = "".join(args)              
            
                k = KantGenerator(grammar, source)
                print k.output()
            
            if __name__ == "__main__":
                main(sys.argv[1:])
            

            例 9.2. toolbox.py

            """Miscellaneous utility functions"""
            
            def openAnything(source):            
                """URI, filename, or string --> stream
            
                This function lets you define parsers that take any input source
                (URL, pathname to local or network file, or actual data as a string)
                and deal with it in a uniform manner.  Returned object is guaranteed
                to have all the basic stdio read methods (read, readline, readlines).
                Just .close() the object when you're done with it.
                
                Examples:
                >>> from xml.dom import minidom
                >>> sock = openAnything("http://localhost/kant.xml")
                >>> doc = minidom.parse(sock)
                >>> sock.close()
                >>> sock = openAnything("c:\\inetpub\\wwwroot\\kant.xml")
                >>> doc = minidom.parse(sock)
                >>> sock.close()
                >>> sock = openAnything("<ref id='conjunction'><text>and</text><text>or</text></ref>")
                >>> doc = minidom.parse(sock)
                >>> sock.close()
                """
                if hasattr(source, "read"):
                    return source
            
                if source == '-':
                    import sys
                    return sys.stdin
            
                # try to open with urllib (if source is http, ftp, or file URL)
                import urllib                         
                try:                                  
                    return urllib.urlopen(source)     
                except (IOError, OSError):            
                    pass                              
                
                # try to open with native open function (if source is pathname)
                try:                                  
                    return open(source)               
                except (IOError, OSError):            
                    pass                              
                
                # treat source as string
                import StringIO                       
                return StringIO.StringIO(str(source)) 
            

            獨立運行程序 kgp.py,它會解析 kant.xml 中默認的基于 XML 的語法,并以康德的風格打印出幾段有哲學價值的段落來。

            例 9.3. kgp.py 的樣例輸出

            [you@localhost kgp]$ python kgp.py
                 As is shown in the writings of Hume, our a priori concepts, in
            reference to ends, abstract from all content of knowledge; in the study
            of space, the discipline of human reason, in accordance with the
            principles of philosophy, is the clue to the discovery of the
            Transcendental Deduction.  The transcendental aesthetic, in all
            theoretical sciences, occupies part of the sphere of human reason
            concerning the existence of our ideas in general; still, the
            never-ending regress in the series of empirical conditions constitutes
            the whole content for the transcendental unity of apperception.  What
            we have alone been able to show is that, even as this relates to the
            architectonic of human reason, the Ideal may not contradict itself, but
            it is still possible that it may be in contradictions with the
            employment of the pure employment of our hypothetical judgements, but
            natural causes (and I assert that this is the case) prove the validity
            of the discipline of pure reason.  As we have already seen, time (and
            it is obvious that this is true) proves the validity of time, and the
            architectonic of human reason, in the full sense of these terms,
            abstracts from all content of knowledge.  I assert, in the case of the
            discipline of practical reason, that the Antinomies are just as
            necessary as natural causes, since knowledge of the phenomena is a
            posteriori.
                The discipline of human reason, as I have elsewhere shown, is by
            its very nature contradictory, but our ideas exclude the possibility of
            the Antinomies.  We can deduce that, on the contrary, the pure
            employment of philosophy, on the contrary, is by its very nature
            contradictory, but our sense perceptions are a representation of, in
            the case of space, metaphysics.  The thing in itself is a
            representation of philosophy.  Applied logic is the clue to the
            discovery of natural causes.  However, what we have alone been able to
            show is that our ideas, in other words, should only be used as a canon
            for the Ideal, because of our necessary ignorance of the conditions.
            
            [...snip...]

            這當然是胡言亂語。噢,不完全是胡言亂語。它在句法和語法上都是正確的 (盡管非常羅嗦――康德可不是你們所說的踩得到點上的那種人)。其中一些實際上是正確的 (或者至少康德可能會認同的事情),其中一些則明顯是錯誤的,大部分只是語無倫次。但所有內容都符合康德的風格。

            讓我重復一遍,如果你現在或曾經主修哲學專業,這會非常、非常有趣。

            有趣之處在于,這個程序中沒有一點內容是屬于康德的。所有的內容都來自于上下文無關語法文件 kant.xml。如果你要程序使用不同的語法文件 (可以在命令行中指定),輸出信息將完全不同。

            例 9.4. kgp.py 的簡單輸出

            [you@localhost kgp]$ python kgp.py -g binary.xml
            00101001
            [you@localhost kgp]$ python kgp.py -g binary.xml
            10110100

            在本章后面的內容中,你將近距離地觀察語法文件的結構。現在,你只要知道語法文件定義了輸出信息的結構,而 kgp.py 程序讀取語法規則并隨機確定哪些單詞插入哪里。

            <span id="7ztzv"></span>
            <sub id="7ztzv"></sub>

            <span id="7ztzv"></span><form id="7ztzv"></form>

            <span id="7ztzv"></span>

                  <address id="7ztzv"></address>

                      亚洲欧美在线