Document
Object Model
The Document
Object Model (DOM) is a cross-platform and language-independent convention for
representing and interacting with objects in HTML, XHTML and XML documents.
Objects in the DOM tree may be addressed and manipulated by using methods on
the objects. The public interface of a DOM is specified in its application
programming interface (API). The history of the Document Object Model is
intertwined with the history of the "browser wars" of the late 1990s
between Netscape Navigator and Microsoft Internet Explorer, as well as with
that of JavaScript and JScript, the first scripting languages to be widely
implemented in the layout engines of web browsers.
History:
Legacy DOM
JavaScript was
released by Netscape Communications in 1996 within Netscape Navigator 2.0.
Netscape's competitor, Microsoft, released Internet Explorer 3.0 later the same
year with a port of JavaScript called JScript. JavaScript and JScript let web
developers create web pages with client-side interactivity. The limited
facilities for detecting user-generated events and modifying the HTML document
in the first generation of these languages eventually became known as "DOM
Level 0" or "Legacy DOM." No independent standard was developed
for DOM Level 0, but it was partly described in the specification of HTML 4.
Legacy DOM was
limited in the kinds of elements that could be accessed. Form, link and image
elements could be referenced with a hierarchical name that began with the root
document object. A hierarchical name could make use of either the names or the
sequential index of the traversed elements. For example, a form input element
could be accessed as either "document.formName.inputName" or
"document.forms[0].elements[0]."
The Legacy DOM
enabled client-side form validation and the popular "rollover"
effect.
Intermediate DOM
In 1997,
Netscape and Microsoft released version 4.0 of Netscape Navigator and Internet
Explorer respectively, adding support for Dynamic HTML (DHTML), functionality
enabling changes to a loaded HTML document. DHTML required extensions to the
rudimentary document object that was available in the Legacy DOM
implementations. Although the Legacy DOM implementations were largely
compatible since JScript was based on JavaScript, the DHTML DOM extensions were
developed in parallel by each browser maker and remained incompatible. These
versions of the DOM became known as the "Intermediate DOM."
Standardization
The World Wide
Web Consortium (W3C), founded in 1994 to promote open standards for the World
Wide Web, brought Netscape Communications and Microsoft together with other
companies to develop a standard for browser scripting languages, called
"ECMAScript." The first version of the standard was published in
1997. Subsequent releases of JavaScript and JScript would implement the
ECMAScript standard for greater cross-browser compatibility.
After the
release of ECMAScript, W3C began work on a standardized DOM. The initial DOM
standard, known as "DOM Level 1," was recommended by W3C in late
1998. About the same time, Internet Explorer 5.0 shipped with limited support
for DOM Level 1. DOM Level 1 provided a complete model for an entire HTML or
XML document, including means to change any portion of the document.
Non-conformant browsers such as Internet Explorer 4.x and Netscape 4.x were
still widely used as late as 2000. DOM Level 2 was published in late 2000. It
introduced the "getElementById" function as well as an event model
and support for XML namespaces and CSS. DOM Level 3, the current release of the
DOM specification, published in April 2004, added support for XPath and
keyboard event handling, as well as an interface for serializing documents as
XML.
DOM Level 4 is
currently being developed. Draft version 6 was released in December 2012.
By 2005, large
parts of W3C DOM were well-supported by common ECMAScript-enabled browsers,
including Microsoft Internet Explorer version 6 (from 2001), Opera, Safari and
Gecko-based browsers (like Mozilla, Firefox, SeaMonkey and Camino).
Applications
Web browsers:
To render a
document such as an HTML page, most web browsers use an internal model similar
to the DOM. The nodes of every document are organized in a tree structure,
called the DOM tree, with topmost node named "Document object". When
an HTML page is rendered in browsers, the browser downloads the HTML into local
memory and automatically parses it to display the page on screen. The DOM is
also the way JavaScript transmits the state of the browser in HTML pages.
Implementations
Because DOM
supports navigation in any direction (e.g., parent and previous sibling) and allows
for arbitrary modifications, an implementation must at least buffer the
document that has been read so far (or some parsed form of it).
Layout engines
Web browsers
rely on layout engines to parse HTML into a DOM. Some layout engines such as
Trident/MSHTML and Presto are associated primarily or exclusively with a
particular browser such as Internet Explorer and Opera respectively. Others,
such as WebKit and Gecko, are shared by a number of browsers, such as Google
Chrome, Firefox and Safari. The different layout engines implement the DOM
standards to varying degrees of compliance.
Simple
API for XML
SAX (Simple API
for XML) is an event-based sequential access parser API developed by the
XML-DEV mailing list for XML documents. SAX provides a mechanism for reading
data from an XML document that is an alternative to that provided by the
Document Object Model (DOM). Where the DOM operates on the document as a whole,
SAX parsers operate on each piece of the XML document sequentially.
Definition
Unlike DOM,
there is no formal specification for SAX. The Java implementation of SAX is
considered to be normative. SAX processes documents state-dependently, in
contrast to DOM which is used for state-independent processing of XML documents.
Benefits
SAX parsers have
some benefits over DOM-style parsers. A SAX parser only needs to report each
parsing event as it happens, and normally discards almost all of that
information once reported (it does, however, keep some things, for example a list
of all elements that have not been closed yet, in order to catch later errors
such as end-tags in the wrong order). Thus, the minimum memory required for a
SAX parser is proportional to the maximum depth of the XML file (i.e., of the
XML tree) and the maximum data involved in a single XML event (such as the name
and attributes of a single start-tag, or the content of a processing
instruction, etc.).
This much memory
is usually considered negligible. A DOM parser, in contrast, typically builds a
tree representation of the entire document in memory to begin with, thus using
memory that increases with the entire document length. This takes considerable
time and space for large documents (memory allocation and data-structure
construction take time). The compensating advantage, of course, is that once
loaded any part of the document can be accessed in any order.
Because of the
event-driven nature of SAX, processing documents is generally far faster than
DOM-style parsers, so long as the processing can be done in a start-to-end
pass. Many tasks, such as indexing, conversion to other formats, very simple
formatting, and the like, can be done that way. Other tasks, such as sorting,
rearranging sections, getting from a link to its target, looking up information
on one element to help process a later one, and the like, require accessing the
document structure in complex orders and will be much faster with DOM than with
multiple SAX passes.
Some
implementations do not neatly fit either category: a DOM approach can keep its
persistent data on disk, cleverly organized for speed (editors such as SoftQuad
Author/Editor and large-document browser/indexers such as DynaText do this);
while a SAX approach can cleverly cache information for later use (any
validating SAX parser keeps more information than described above). Such
implementations blur the DOM/SAX tradeoffs, but are often very effective in
practice.
Due to the
nature of DOM, streamed reading from disk requires techniques such as lazy
evaluation, caches, virtual memory, persistent data structures, or other
techniques (one such technique is disclosed in ). Processing XML documents
larger than main memory is sometimes thought impossible because some DOM
parsers do not allow it. However, it is no less possible than sorting a dataset
larger than main memory using disk space as memory to sidestep this limitation.
Drawbacks
The event-driven
model of SAX is useful for XML parsing, but it does have certain
drawbacks.Virtually any kind of XML validation requires access to the document
in full. The most trivial example is that an attribute declared in the DTD to
be of type IDREF, requires that there be an element in the document that uses
the same value for an ID attribute. To validate this in a SAX parser, one must
keep track of all ID attributes (any one of them might end up being referenced
by an IDREF attribute at the very end); as well as every IDREF attribute until
it is resolved. Similarly, to validate that each element has an acceptable
sequence of child elements, information about what child elements have been
seen for each parent, must be kept until the parent closes.
Additionally,
some kinds of XML processing simply require having access to the entire
document. XSLT and XPath, for example, need to be able to access any node at
any time in the parsed XML tree. Editors and browsers likewise need to be able
to display, modify, and perhaps re-validate at any time. While a SAX parser may
well be used to construct such a tree initially, SAX provides no help for such
processing as a whole.
XML processing with SAX
A parser that
implements SAX (i.e., a SAX Parser) functions as a stream parser, with an
event-driven API. The user defines a number of callback methods that will be
called when events occur during parsing. The SAX events include (among others):
·
XML
Text nodes
·
XML
Element Starts and Ends
·
XML
Processing Instructions
·
XML
Comments
Some events
correspond to XML objects that are easily returned all at once, such as
comments. However, XML elements can contain many other XML objects, and so SAX
represents them as does XML itself: by one event at the beginning, and another
at the end. Properly speaking, the SAX interface does not deal in elements, but
in events that largely correspond to tags. SAX parsing is unidirectional; previously
parsed data cannot be re-read without starting the parsing operation again.
There are many
SAX-like implementations in existence. In practice, details vary, but the
overall model is the same. For example, XML attributes are typically provided
as name and value arguments passed to element events, but can also be provided
as separate events, or via a hash or similar collection of all the attributes.
For another, some implementations provide "Init" and "Fin"
callbacks for the very start and end of parsing; others don't. The exact names
for given event types also vary slightly between implementations.
XML
& ITS APPLICATIONS
Extensible
Markup Language (XML) is a markup language that defines a set of rules for
encoding documents in a format that is both human-readable and
machine-readable. It is defined in the XML 1.0 Specification produced by the
W3C, and several other related specifications,all gratis open standards.
The design goals
of XML emphasize simplicity, generality, and usability over the Internet.It is
a textual data format with strong support via Unicode for the languages of the
world. Although the design of XML focuses on documents, it is widely used for
the representation of arbitrary data structures, for example in web services.
Many application
programming interfaces (APIs) have been developed to aid software developers
with processing XML data, and several schema systems exist to aid in the
definition of XML-based languages.
As of 2009,
hundreds of document formats using XML syntax have been developed, including
RSS, Atom, SOAP, and XHTML. XML-based formats have become the default for many
office-productivity tools, including Microsoft Office (Office Open XML),
OpenOffice.org and LibreOffice (OpenDocument), and Apple's iWork. XML has also
been employed as the base language for communication protocols, such as XMPP.
Well-formedness and error-handling
The XML
specification defines an XML document as a well-formed text - meaning that it
satisfies a list of syntax rules provided in the specification. Some key points
in the fairly lengthy list include:
·
The
document contains only properly encoded legal Unicode characters
·
None
of the special syntax characters such as < and & appear except when
performing their markup-delineation roles
·
The
begin, end, and empty-element tags that delimit the elements are correctly
nested, with none missing and none overlapping
·
The
element tags are case-sensitive; the beginning and end tags must match exactly.
Tag names cannot contain any of the characters
!"#$%&'()*+,/;<=>?@[\]^`{|}~, nor a space character, and cannot
start with -, ., or a numeric digit.
·
A
single "root" element contains all the other elements
The definition
of an XML document excludes texts that contain violations of well-formedness
rules; they are simply not XML. An XML processor that encounters such a
violation is required to report such errors and to cease normal processing.
This policy, occasionally referred to as draconian, stands in notable contrast
to the behavior of programs that process HTML, which are designed to produce a
reasonable result even in the presence of severe markup errors.[14] XML's
policy in this area has been criticized as a violation of Postel's law
("Be conservative in what you send; be liberal in what you
accept").[15]
The XML
specification defines a valid XML document as a well-formed XML document which
also conforms to the rules of a Document Type Definition (DTD). By extension,
the term can also refer to documents that conform to rules in other schema
languages, such as XML Schema (XSD). This term should not be confused with a
well-formed XML document, which is defined as an XML document that has correct
XML syntax according to W3C standards.
Schemas and validation
In addition to
being well-formed, an XML document may be valid. This means that it contains a
reference to a Document Type Definition (DTD), and that its elements and
attributes are declared in that DTD and follow the grammatical rules for them
that the DTD specifies.
XML processors
are classified as validating or non-validating depending on whether or not they
check XML documents for validity. A processor that discovers a validity error
must be able to report it, but may continue normal processing.A DTD is an
example of a schema or grammar. Since the initial publication of XML 1.0, there
has been substantial work in the area of schema languages for XML. Such schema
languages typically constrain the set of elements that may be used in a
document, which attributes may be applied to them, the order in which they may
appear, and the allowable parent/child relationships.
No comments:
Post a Comment