Web Technology: DOM & SAX & XML

Document Object Model

The Document Object Model (DOM) is a cross-platform and language-independent convention for representing and interacting with objects in HTML, XHTML and XML documents. Objects in the DOM tree may be addressed and manipulated by using methods on the objects. The public interface of a DOM is specified in its application programming interface (API). The history of the Document Object Model is intertwined with the history of the "browser wars" of the late 1990s between Netscape Navigator and Microsoft Internet Explorer, as well as with that of JavaScript and JScript, the first scripting languages to be widely implemented in the layout engines of web browsers.

History:

Legacy DOM

JavaScript was released by Netscape Communications in 1996 within Netscape Navigator 2.0. Netscape's competitor, Microsoft, released Internet Explorer 3.0 later the same year with a port of JavaScript called JScript. JavaScript and JScript let web developers create web pages with client-side interactivity. The limited facilities for detecting user-generated events and modifying the HTML document in the first generation of these languages eventually became known as "DOM Level 0" or "Legacy DOM." No independent standard was developed for DOM Level 0, but it was partly described in the specification of HTML 4.

Legacy DOM was limited in the kinds of elements that could be accessed. Form, link and image elements could be referenced with a hierarchical name that began with the root document object. A hierarchical name could make use of either the names or the sequential index of the traversed elements. For example, a form input element could be accessed as either "document.formName.inputName" or "document.forms[0].elements[0]."

The Legacy DOM enabled client-side form validation and the popular "rollover" effect.

Intermediate DOM

In 1997, Netscape and Microsoft released version 4.0 of Netscape Navigator and Internet Explorer respectively, adding support for Dynamic HTML (DHTML), functionality enabling changes to a loaded HTML document. DHTML required extensions to the rudimentary document object that was available in the Legacy DOM implementations. Although the Legacy DOM implementations were largely compatible since JScript was based on JavaScript, the DHTML DOM extensions were developed in parallel by each browser maker and remained incompatible. These versions of the DOM became known as the "Intermediate DOM."

Standardization

The World Wide Web Consortium (W3C), founded in 1994 to promote open standards for the World Wide Web, brought Netscape Communications and Microsoft together with other companies to develop a standard for browser scripting languages, called "ECMAScript." The first version of the standard was published in 1997. Subsequent releases of JavaScript and JScript would implement the ECMAScript standard for greater cross-browser compatibility.

After the release of ECMAScript, W3C began work on a standardized DOM. The initial DOM standard, known as "DOM Level 1," was recommended by W3C in late 1998. About the same time, Internet Explorer 5.0 shipped with limited support for DOM Level 1. DOM Level 1 provided a complete model for an entire HTML or XML document, including means to change any portion of the document. Non-conformant browsers such as Internet Explorer 4.x and Netscape 4.x were still widely used as late as 2000. DOM Level 2 was published in late 2000. It introduced the "getElementById" function as well as an event model and support for XML namespaces and CSS. DOM Level 3, the current release of the DOM specification, published in April 2004, added support for XPath and keyboard event handling, as well as an interface for serializing documents as XML.

DOM Level 4 is currently being developed. Draft version 6 was released in December 2012.

By 2005, large parts of W3C DOM were well-supported by common ECMAScript-enabled browsers, including Microsoft Internet Explorer version 6 (from 2001), Opera, Safari and Gecko-based browsers (like Mozilla, Firefox, SeaMonkey and Camino).

Applications

Web browsers:

To render a document such as an HTML page, most web browsers use an internal model similar to the DOM. The nodes of every document are organized in a tree structure, called the DOM tree, with topmost node named "Document object". When an HTML page is rendered in browsers, the browser downloads the HTML into local memory and automatically parses it to display the page on screen. The DOM is also the way JavaScript transmits the state of the browser in HTML pages.

Implementations

Because DOM supports navigation in any direction (e.g., parent and previous sibling) and allows for arbitrary modifications, an implementation must at least buffer the document that has been read so far (or some parsed form of it).

Layout engines

Web browsers rely on layout engines to parse HTML into a DOM. Some layout engines such as Trident/MSHTML and Presto are associated primarily or exclusively with a particular browser such as Internet Explorer and Opera respectively. Others, such as WebKit and Gecko, are shared by a number of browsers, such as Google Chrome, Firefox and Safari. The different layout engines implement the DOM standards to varying degrees of compliance.

Simple API for XML

SAX (Simple API for XML) is an event-based sequential access parser API developed by the XML-DEV mailing list for XML documents. SAX provides a mechanism for reading data from an XML document that is an alternative to that provided by the Document Object Model (DOM). Where the DOM operates on the document as a whole, SAX parsers operate on each piece of the XML document sequentially.

Definition

Unlike DOM, there is no formal specification for SAX. The Java implementation of SAX is considered to be normative. SAX processes documents state-dependently, in contrast to DOM which is used for state-independent processing of XML documents.

Benefits

SAX parsers have some benefits over DOM-style parsers. A SAX parser only needs to report each parsing event as it happens, and normally discards almost all of that information once reported (it does, however, keep some things, for example a list of all elements that have not been closed yet, in order to catch later errors such as end-tags in the wrong order). Thus, the minimum memory required for a SAX parser is proportional to the maximum depth of the XML file (i.e., of the XML tree) and the maximum data involved in a single XML event (such as the name and attributes of a single start-tag, or the content of a processing instruction, etc.).

This much memory is usually considered negligible. A DOM parser, in contrast, typically builds a tree representation of the entire document in memory to begin with, thus using memory that increases with the entire document length. This takes considerable time and space for large documents (memory allocation and data-structure construction take time). The compensating advantage, of course, is that once loaded any part of the document can be accessed in any order.

Because of the event-driven nature of SAX, processing documents is generally far faster than DOM-style parsers, so long as the processing can be done in a start-to-end pass. Many tasks, such as indexing, conversion to other formats, very simple formatting, and the like, can be done that way. Other tasks, such as sorting, rearranging sections, getting from a link to its target, looking up information on one element to help process a later one, and the like, require accessing the document structure in complex orders and will be much faster with DOM than with multiple SAX passes.

Some implementations do not neatly fit either category: a DOM approach can keep its persistent data on disk, cleverly organized for speed (editors such as SoftQuad Author/Editor and large-document browser/indexers such as DynaText do this); while a SAX approach can cleverly cache information for later use (any validating SAX parser keeps more information than described above). Such implementations blur the DOM/SAX tradeoffs, but are often very effective in practice.

Due to the nature of DOM, streamed reading from disk requires techniques such as lazy evaluation, caches, virtual memory, persistent data structures, or other techniques (one such technique is disclosed in ). Processing XML documents larger than main memory is sometimes thought impossible because some DOM parsers do not allow it. However, it is no less possible than sorting a dataset larger than main memory using disk space as memory to sidestep this limitation.

Drawbacks

The event-driven model of SAX is useful for XML parsing, but it does have certain drawbacks.Virtually any kind of XML validation requires access to the document in full. The most trivial example is that an attribute declared in the DTD to be of type IDREF, requires that there be an element in the document that uses the same value for an ID attribute. To validate this in a SAX parser, one must keep track of all ID attributes (any one of them might end up being referenced by an IDREF attribute at the very end); as well as every IDREF attribute until it is resolved. Similarly, to validate that each element has an acceptable sequence of child elements, information about what child elements have been seen for each parent, must be kept until the parent closes.

Additionally, some kinds of XML processing simply require having access to the entire document. XSLT and XPath, for example, need to be able to access any node at any time in the parsed XML tree. Editors and browsers likewise need to be able to display, modify, and perhaps re-validate at any time. While a SAX parser may well be used to construct such a tree initially, SAX provides no help for such processing as a whole.

XML processing with SAX

A parser that implements SAX (i.e., a SAX Parser) functions as a stream parser, with an event-driven API. The user defines a number of callback methods that will be called when events occur during parsing. The SAX events include (among others):

· XML Text nodes

· XML Element Starts and Ends

· XML Processing Instructions

· XML Comments

Some events correspond to XML objects that are easily returned all at once, such as comments. However, XML elements can contain many other XML objects, and so SAX represents them as does XML itself: by one event at the beginning, and another at the end. Properly speaking, the SAX interface does not deal in elements, but in events that largely correspond to tags. SAX parsing is unidirectional; previously parsed data cannot be re-read without starting the parsing operation again.

There are many SAX-like implementations in existence. In practice, details vary, but the overall model is the same. For example, XML attributes are typically provided as name and value arguments passed to element events, but can also be provided as separate events, or via a hash or similar collection of all the attributes. For another, some implementations provide "Init" and "Fin" callbacks for the very start and end of parsing; others don't. The exact names for given event types also vary slightly between implementations.

XML & ITS APPLICATIONS

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications,all gratis open standards.

The design goals of XML emphasize simplicity, generality, and usability over the Internet.It is a textual data format with strong support via Unicode for the languages of the world. Although the design of XML focuses on documents, it is widely used for the representation of arbitrary data structures, for example in web services.

Many application programming interfaces (APIs) have been developed to aid software developers with processing XML data, and several schema systems exist to aid in the definition of XML-based languages.

As of 2009, hundreds of document formats using XML syntax have been developed, including RSS, Atom, SOAP, and XHTML. XML-based formats have become the default for many office-productivity tools, including Microsoft Office (Office Open XML), OpenOffice.org and LibreOffice (OpenDocument), and Apple's iWork. XML has also been employed as the base language for communication protocols, such as XMPP.

Well-formedness and error-handling

The XML specification defines an XML document as a well-formed text - meaning that it satisfies a list of syntax rules provided in the specification. Some key points in the fairly lengthy list include:

· The document contains only properly encoded legal Unicode characters

· None of the special syntax characters such as < and & appear except when performing their markup-delineation roles

· The begin, end, and empty-element tags that delimit the elements are correctly nested, with none missing and none overlapping

· The element tags are case-sensitive; the beginning and end tags must match exactly. Tag names cannot contain any of the characters !"#$%&'()*+,/;<=>?@[\]^`{|}~, nor a space character, and cannot start with -, ., or a numeric digit.

· A single "root" element contains all the other elements

The definition of an XML document excludes texts that contain violations of well-formedness rules; they are simply not XML. An XML processor that encounters such a violation is required to report such errors and to cease normal processing. This policy, occasionally referred to as draconian, stands in notable contrast to the behavior of programs that process HTML, which are designed to produce a reasonable result even in the presence of severe markup errors.[14] XML's policy in this area has been criticized as a violation of Postel's law ("Be conservative in what you send; be liberal in what you accept").[15]

The XML specification defines a valid XML document as a well-formed XML document which also conforms to the rules of a Document Type Definition (DTD). By extension, the term can also refer to documents that conform to rules in other schema languages, such as XML Schema (XSD). This term should not be confused with a well-formed XML document, which is defined as an XML document that has correct XML syntax according to W3C standards.

Schemas and validation

In addition to being well-formed, an XML document may be valid. This means that it contains a reference to a Document Type Definition (DTD), and that its elements and attributes are declared in that DTD and follow the grammatical rules for them that the DTD specifies.

XML processors are classified as validating or non-validating depending on whether or not they check XML documents for validity. A processor that discovers a validity error must be able to report it, but may continue normal processing.A DTD is an example of a schema or grammar. Since the initial publication of XML 1.0, there has been substantial work in the area of schema languages for XML. Such schema languages typically constrain the set of elements that may be used in a document, which attributes may be applied to them, the order in which they may appear, and the allowable parent/child relationships.

Web Technology

Monday, 29 July 2013

DOM & SAX & XML

No comments:

Post a Comment