IT博客-搜索引擎-文章分类-编程

Tools for Natural Language Processing

专心练剑 — Mon, 28 Jan 2008 09:52:00 GMT

Text simplification - Wikipedia, the free encyclopedia: Text simplification is an operation used in natural language processing to modify, enhance, classify or otherwise process an existing corpus of human-readable text in such a way that the grammar and structure of the prose is greatly simplified, while the underlying meaning and information remains the same. Text simplification is an important area of research, because natural human languages ordinarily contain complex compound constructions that are not easily processed through automation.

CoPT, Corpus Processing Tools: CoPT, Corpus Processing Tools, is a set of java classes intended to assist field linguists, NLP researchers and developers, students and software developers in all corpus-related processing.

Jazzy - Java spell checker API: Jazzy is a Java spell checker based on the algorithms used by aspell.

JLinkGrammarParser: JLinkGrammarParser is a Java port of the CMU link grammar parser, a syntactic parser for english.

jSpellCorrect: It’s a simple statistical spelling corrector.

jTokeniser: jTokeniser is a Java library for tokenising strings into a list of tokens. A variety of possible tokenisers are available, including a very basic whitespace tokeniser, a more flexible StringTokeniser, a couple of regular expression tokenisers, and a tokeniser that utilises Java’s BreakIterator, which provides more complex, locale dependant tokenisation. More recently, a tokeniser that add breaks text into its constituent sentences. All are very simple to use.

Linguistic Tree Constructor: LTC is a free program for building linguistic syntax trees from text.
It lets the user build the tree in a point-and-click fashion.
The program does no analysis on its own — the user is completely free to draw the tree however he or she wishes. However, the program makes sure that the tree is a tree and not some other kind of graph.

MII Medical NLP Toolkit: This is a toolkit for medical natural language processing (NLP). The core engine is general enough to be used in a variety of text processing domains, though the toolkit includes specific support for medical reports and patient de-identification.

nlpFarm: The nlpFarm is a Natural Language Processing (NLP) resource where early research prototypes (Java) can evolve into robust and useful open source. Our farmstead collaborates under the OpenNLP initiative, in order to make NLP software publically available.

OpenNLP: OpenNLP provides the organizational structure for coordinating several different projects which approach some aspect of Natural Language Processing. OpenNLP also defines a set of Java interfaces and implements some basic infrastructure for NLP components

Open source natural language tools: Toolkit for implementing question answering systems and machine translation in both controlled languages and natural languages. Includes first order logic inference, parsing and semantic analysis, and APIs and standalone server software. Currently some t

The OpenNLP Grok Library: Grok is a library of natural language processing components, including support for parsing with categorial grammars and various preprocessing tasks such as part-of-speech tagging, sentence detection, and tokenization.

The OpenNLP Leo Project: Leo is a project to provide an architecture for defining XML specifications of grammars for different natural language parsing systems and tools for using that architecture to permit sharing of grammar resources across different systems.

The OpenNLP Maximum Entropy Package: Maximum entropy is a powerful method for constructing statistical models of classification tasks, such as part of speech tagging in Natural Language Processing. Several example applications using maxent can be found in the OpenNLP Grok Library.

Visuwords™ online graphical dictionary - download source code: Download the source code for Visuwords.

Balie: Extraction from Text with Machine Learning and Natural Language Techniques

FerFT: Spectral Analyzer: This software is for multi-purpose power spectral analyzer based on the successive Fourier transformation method. (® UTD) It has been developed with Java (ver.1.5) and works on any OS implemented Java ver.1.5 or later.

Julius Speech Recognition Engine: Julius Speech recognition engine

Modular Audio Recognition Framework: MARF is a general cross-platform framework with a collection of algorithms for audio (voice, speech, and sound) and natural language text analysis and recognition along with sample applications (identification, NLP, etc.) of its use, implemented in Java.

VoxForge 0.0.1: Speech recognition support

OpenCCG: The OpenNLP CCG Library: OpenCCG, the OpenNLP CCG Library, is an open source natural language processing library written in Java, which provides parsing and realization services based on Mark Steedman’s Combinatory Categorial Grammar (CCG) formalism.

Joone: Joone (Java Object Oriented Neural Engine) is an artificial neural network Java framework. It is used to build and train neural networks with a powerful visual environment. It has a modular design and can be easily extended by writing new modules to implement new learning algorithms or architectures.

专心练剑 2008-01-28 17:52 发表评论

Tools for Robotics

专心练剑 — Mon, 28 Jan 2008 09:52:00 GMT

CLARAty Software: This site contains information about the CLARAty reusable robotic software framework, videos of the capabilities that were demonstrated on real and simulated robotic platforms, and information on how to download and run the software. CLARAty stands for Coupled-Layer Architecture for Robotic Autonomy. It is a collaborative effort among four institutions: Jet Propulsion Laboratory, NASA Ames Research Center, Carnegie Mellon, and the University of Minnesota.

jHomeNet :: Java Home Automation: jHomeNet is a home automation application written primarily in Java used to monitor and control sensors and devices around your house. The application uses of a number of existing communication technologies including Dallas Semiconductor’s 1-Wire and X-10 protocols. Administration and control of the software is through a GUI written entirely using the Swing development tools but makes use of a number of third party libraries.

C# machine vision: Sentience is a stereoscopic vision and mapping system for mobile robots. It was developed initially as part of the Rodney humanoid robot project, and has been refined over several years. The system uses cheap low resolution webcam technology to acquire images and calculate a depth map from them.

DP-SLAM robot vision: Welcome to the DP-SLAM web page. DP-SLAM aims to achieve truly simultaneous localization and mapping without landmarks. While DP-SLAM is compatible with techniques that correct maps when a loop is closed, we have found that DP-SLAM is accurate enough that no special loop closing techniques are required in most cases. DP-SLAM makes only a single pass over the sensor data.

javavis: A Computer Vision Library in Java

NeatVision: NeatVision is a free Java based image analysis and software development environment, which provides high level access to a wide range of image processing algorithms through well defined and easy to use graphical interface. NeatVision is in its second major release. New features include: A full developers guide with method listings and programme examples, DICOM and Analyze medical image sequence viewers, URL control, feature fitting, supervised and unsupervised colour clustering, DCT, Improved FFT, 3D volume processing and surface rendering.

Orocos: Open Robot Control Software project. The project’s aim is to develop a general-purpose, free software, and modular framework for robotand machine control. The Orocos project supports 4 C++ libraries: the Real-Time Toolkit, the Kinematics and Dynamics Library, the Bayesian Filtering Library and the Orocos Component Library.

S.O.N.I.A. - Système d’Opération Nautique Intelligent et Autonome: Student club of Ecole de Technologie Superieure who build an autonomous submarine for the AUVSI competition.

The Orocos Project | Smarter control in robotics & automation!: Home for C++ libraries for advanced machine and robot control.

专心练剑 2008-01-28 17:52 发表评论

Java Development Tools

专心练剑 — Mon, 28 Jan 2008 09:48:00 GMT

Apache Maven

: Maven is a software project management and comprehension tool. Based on the concept of a project object model (POM), Maven can manage a project’s build, reporting and documentation from a central piece of information.
Beanlet - JSE Application Container - Confluence: Inspired by EJB3 and Spring, Beanlet delivers an IoC enabled application container offering the best of both worlds. Beanlet’s programming model looks similar to that of EJB3, but its flexibility is comparable to that of Spring. The Beanlet architecture supports JTA transactions, the Java Persistence API, JNDI, Web integration, and last but not least, the Spring Framework.
Bean Shell: BeanShell is a small, free, embeddable Java source interpreter with object scripting language features, written in Java. BeanShell dynamically executes standard Java syntax and extends it with common scripting conveniences such as loose types, commands, and method closures like those in Perl and JavaScript.
cdrtools 2.01.01a17 (Development): About: cdrtools (formerly cdrecord) creates home-burned CDs/DVDs with a CDR/CDRW/DVD recorder. It works as a burn engine for several applications. It supports CD/DVD recorders from many different vendors; all SCSI-3/mmc- and ATAPI/mmc-compliant drives should also work. Supported features include IDE/ATAPI, parallel port, and SCSI drives, audio CDs, data CDs, and mixed CDs, full multi-session support, CDRWs (rewritable), DVD-R/-RW, DVD+R/+RW, TAO, DAO, RAW, and human-readable error messages. cdrtools includes remote SCSI support and can access local or remote CD/DVD writers.
Coadunation daemon server: Coadunation open source daemon server
Crossroads load balancer: Crossroads is an open source load balance and fail over utility for TCP based services. It is a daemon running in user space, and features extensive configurability, polling of back ends using ‘wakeup calls’, detailed status reporting, ‘hooks’ for special actions when backend calls fail, and much more. Crossroads is service-independent: it is usable for HTTP(S), SSH, SMTP, DNS, etc.. In the case of HTTP balancing, Crossroads can provide ’session stickiness’ for back end processes that need sessions, but aren’t session-aware of other back ends.
freedesktop.org - Software/dbus: D-Bus is a message bus system, a simple way for applications to talk to one another. In addition to interprocess communication, D-Bus helps coordinate process lifecycle; it makes it simple and reliable to code a “single instance” application or daemon, and to launch applications and daemons on demand when their services are needed.
FreeNAS: The Free NAS Server - Home: Free NAS Server
GCViewer: GCViewer is a free open source tool to visualize data produced by the Java VM options -verbose:gc and -Xloggc:. It also calculates garbage collection related performance metrics (throughput, accumulated pauses, longest pause, etc.). This can be very useful when tuning the garbage collection of a particular application by changing generation sizes or setting the initial heap size.
GridGain - Open Source Grid Computing For Java: GridGain Systems provides professional services around our open source Java grid computing framework. We provide enterprise level support, in-depth training and consulting helping our clients to get the most out of our product during initial evaluation, development and production use.
Hadoop Map/Reduce framework: Hadoop implements MapReduce, using the Hadoop Distributed File SystemHDFS) (see figure below.) MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.
Java Parallel Processing Framework: An open-source, Java-based, framework for parallel computing.
Java Print Dialog Framework: The JPDF provides preview and print capabilities
to Java applications. Swing components — like JTable and JTextPane — can be
previewed and printed. Forms and reports can be composed and printed.
A large variety of Page Setup, Preview, and Print dialogs is provided.
javaSVNUpdater: javaSVNUpdater is a Java library that allows an application to update or patch itself automatically. The versioning information about the application needs to be stored in a subversion archive, and committing to the archive effects the distribution of new versions. It includes an updater wizard and an executor to spawn a separate process for proceeding with updates.
Java Units of Measure: This is a Java package with abstract data types for measurable quantities (like volume and speed) and units for measuring them (like liters and furlongs per fortnight). You may convert from one set of units to another, and may add your own quantities and units.
jGCS: The jGCS library provides a generic interface for Group Communication. This interface can be used by applications that need primitives from simple IP Multicast group communication to virtual synchrony or atomic broadcast. Its a common interface to several existing toolkits that provide different APIs.
JGroups (JBoss cluster comm): A Toolkit for Reliable Multicast Communication
Joda Time - Java date and time API: Joda-Time provides a quality replacement for the Java date and time classes. The design allows for multiple calendar systems, while still providing a simple API. The ‘default’ calendar is the ISO8601 standard which is used by XML. The Gregorian, Julian, Buddhist, Coptic, Ethiopic and Islamic systems are also included, and we welcome further additions. Supporting classes include time zone, duration, format and parsing.
JUnitConv: JUnitConv is a free Open Source universal Units of Measure Converter, it converts numbers from one unit of measure to another.
Built as a Java Applet, JUnitConv is platform-independent and highly-configurable, it supports an unlimited number of Units Categories, Units of Measure and Multiplier Prefixes that could be customized using external text files. You could setup your own data files using your preferred spoken language, units categories, units definitions and multiplier prefixes. The default configuration data files contains 580 basic units of measure definitions divided in 31 categories and 27 multiplier prefixes for a total of 15660 composed units.
libreplacer: libreplacer is an easy-to-use string formatting library for Java, which provides some C-sprintf alike syntax, and can be easily extended for all kinds of object to string formattings.
Lobo: Java Web Browser: Lobo is an open source pure Java web browser with support for HTML 4, Javascript and CSS2.
Mozilla Java Html Parser: Mozilla Java Html Parser is a Java package that enables you to parse html pages into a Java Document object. The parser is a wrapper around Mozilla’s Html Parser , thus giving the user a browser-quality html parser.
Mr. Persister: Mr. Persister is a POJO persistence API for Java. The main focus of Mr. Persister is to handle all the trivial JDBC work, and leave the non-trivial parts up to you. It uses plain SQL as the query language, it can auto-map objects to the database, and it can generate SQL for a lot of trivial tasks by itself (such as insert, update, and delete). Mr. Persister also has support for easy batch updates of collections of objects, connection and transaction handling, and many other features.

mubench: mubench is an in-depth, low-level benchmark for x86 processors. Its primary goal is to provide useful information for people who optimize assembly code and for people who write compilers. It measures latency and throughput for each individual instruction (sometimes several forms of the same instruction), as well as the throughput of arbitrary instruction mixes. The results produced by mubench are typically an order of magnitude more detailed than those found in AMD or Intel manuals.

myrpm: Myrpm is a set of utilities allowing you to turn easily software into rpm package. More than a simple set of script, it allow you to manage large groups of server in a elegant and efficient way.

NetBeans HotSpot grapher: Masters Thesis

nlink: NLink - Native Library Linker: Provides a general-purpose method invocation converter driven by annotation. With NLink, calling a native library is as easy as follows, and then the NLink runtime invokes the corresponding method for you

Primrose: Primrose is a database connection pool which supports all databases that have JDBC drivers. It provides control over SQL transaction monitoring, configuration, and dynamic pool management via a Web interface.

pulse 1.1.14: Pulse is an automated build (or continuous integration) server designed to work with you to ensure the integrity of your code. Pulse regularly checks your source code out from your SCM, builds your projects, and notifies you of the results.

recordMyDesktop: recordMyDesktop is a desktop session recorder for linux that attemps to be easy to use,
yet also effective at it’s primary task. As such, the program is separated in two parts; a simple
command line tool that performs the basic tasks of capturing and encoding and an interface that
exposes the program functionality in a usable way.

Redstone Prevalent Storage: Redstone Prevalent Storage is minimalistic prevalent storage for Java SE 5.0 that replaces the need for JDBC and an RDBMS for small and mid-sized applications. The library is comprised of an intentionally small set of concise interfaces and classes, and can be suitable for many types of applications where the data storage does not necessarily require an RDBMS. The library is heavily influenced by Prevayler and the Prevayler team (who should receive all credit).

Sesame

The Sesame 2 RDF store is an open source RDF framework with support for RDF Schema inferencing and querying.

SPARQL Query Language for RDF (Specification): RDF is a directed, labeled graph data format for representing information in the Web. This specification defines the syntax and semantics of the SPARQL query language for RDF. SPARQL can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware. SPARQL contains capabilities for querying required and optional graph patterns along with their conjunctions and disjunctions. SPARQL also supports extensible value testing and constraining queries by source RDF graph. The results of SPARQL queries can be results sets or RDF graphs.

Sun Grid network.com: Sun Web Tier Solutions

TIJmp - Java Memory Profiler - why object not gc’ed: TIJmp is a memory profiler for java. TIJmp is made for java/6 and later, it will not work on java/5 systems. If you need a profiler for java/5 or earlier try the jmp profiler.

Tiny Marbles: Tiny Marbles is a transactional, persistent object repository for dynamic objects. Apart from the initial setup, all the interaction between Tiny Marbles and the application is done programatically at runtime. All objects can be modified after creation.

Yamon, Yet Another Monitoring script: Yamon is a very simple Perl program designed to check whether a server is up-and-running and send an alert to a human when something appears to be broken.

YourKit Java Profiler 6.0-EAP-build1076: YourKit Java Profiler is a CPU and memory profiler that makes it easy to solve wide range of CPU- and memory-related performance problems. It features automatic leak detection, powerful tools for the analysis of memory distribution, an object heap browser, comprehensive memory tests as part of your JUnit testing process, extremely low profiling overhead, transparent deobfuscation support, and integration with Eclipse, JBuilder, IntelliJ IDEA, NetBeans, and JDeveloper IDEs.

iCal4j: iCal4j is a Java API incorporating an iCalendar parser, model, validator, and outputter.

Jakarta POI - Java API To Access Microsoft Format Files: The POI project consists of APIs for manipulating various file formats based upon Microsoft’s OLE 2 Compound Document format using pure Java. In short, you can read and write MS Excel files using Java. Soon, you’ll be able to read and write Word files using Java. POI is your Java Excel solution as well as your Java Word solution. However, we have a complete API for porting other OLE 2 Compound Document formats and welcome others to participate.

JavaPlot: Pure Java programming interface library for GNUPlot

Java Software Components by Big Faceless Organization: Java Software Components from Big Faceless Organization including Report Generator, PDF Library and Graph Library

JBDiff: JBDiff (Java Binary Diff) utility is a Java port of the C based bsdiff utility by Colin Percival.

JBoss.com - Wiki - EmbeddedJBoss: The Professional Open Source Company

JBoss Rules: JBoss Rules is the supported and branded release of the Drools project. Drools is an enhanced Rules Engine implementation, ReteOO, based on Charles Forgy’s Rete algorithm tailored for the Java language. More importantly, Drools provides for Declarative Programming and is flexible enough to match the semantics of your problem domain with Domain Specific Languages.

jgcalapi: JGCalAPI provides an easy to use wrapper for the Google Calendaring GData API. This wrapper is intended to hide much of the REST ugliness of the API, thus making it somewhat easier to get started with and to use.

jweather: jweather is a Java library for parsing raw weather data (e.g. METAR, TAF). It currently focuses on parsing and providing an API for access to METAR data.

ngrease metalanguage: The world’s largest development and download repository of Open Source code and applications

Open Data: The Open Data Commons Public Domain Dedication & Licence is a document intended to allow you to freely share, modify, and use this work for any purpose and without any restrictions. This licence is intended for use on databases or their contents (”data”), either together or individually.

Raptor RDF Parser Library: Raptor is a free software / Open Source C library that provides a set of parsers and serializers that generate Resource Description Framework (RDF) triples by parsing syntaxes or serialize the triples into a syntax. The supported parsing syntaxes are RDF/XML, N-Triples, TRiG, Turtle, RSS tag soup including all versions of RSS, Atom 1.0 and 0.3, GRDDL and microformats for HTML, XHTML and XML. The serializing syntaxes are RDF/XML (regular, and abbreviated), N-Triples, RSS 1.0, Atom 1.0 and Adobe XMP.

rest-client - Google Code: RESTClient is a Java platform client application to test RESTful webservices. It can be used to test variety of HTTP communications.

RIFE : Continuations: Full-stack open-source component framework to quickly and consistently develop and maintain Java web applications

专心练剑 2008-01-28 17:48 发表评论

Texai softwares

专心练剑 — Mon, 28 Jan 2008 09:46:00 GMT

OpenCyc RDF: This release contains RDF statements extracted from the OpenCyc knowledge base. It omits objects that are not directly compatible with RDF, namely non-atomic terms, rules, non-binary relations. Context is only included in the TriG formatted files.

WordNet 2.1 RDF: This is an extract of the Princeton WordNet version 2.1 lexical knowlege base in RDF format. Only the TriG version contains context.

The CMU Pronouncing Dictionary RDF: This release contains The CMU Pronouncing Dictionary converted to RDF. Only the TriG version contains context.

Wiktionary RDF: This release contains the English entries from the Wiktionary, as of Spring 2007, extracted to RDF. Only the TriG version contains context.

Texai Lexicon RDF: This the Texai lexicon which is a merging of WordNet 2.1, the CMU Pronouncing Dictionary, Wiktionary, and the OpenCyc lexicon. Only the TriG version contains context.

RDF Entity Manager: The RDF Entity Manager is the framework for persisting semantically annotated Java classes to the Sesame 2 RDF store.

Texai Lexicon Java Library: This package contains the semantically annotated Java classes for the RDF entities that represent the Texai Lexicon. Included are all the required jar files (e.g. Sesame 2).

Fluid Construction Grammar Java Library: This package is a Java implementation of Fluid Construction, originally implemened in Lisp by researchers at emergent-languages.org. See this brief tutorial. The Emergent Languages web site has more information about Fluid Construction Grammar and its role in emergent languages research. This Java release will be feature-frozen and new features, namely incremental parsing, will be placed into a new Java release IncrementalFCG now under development.

Texai Utilities Java Library: This Java class library provides utilities for the remainder of the Texai project.

专心练剑 2008-01-28 17:46 发表评论

Generating XML via Java

专心练剑 — Sat, 26 Jan 2008 02:33:00 GMT

Generating XML via Java

XML developers used to rely on XML parsers to read XML files. They also used to rely on XML processors to transform XML to *ML (HTML, XML ...). However, most of them forget these tools to generate XML from scratch. They should not ...

Below, the XML file (users.xml) you want to generate from input data. It's just a list of user. Each user has an ID a TYPE and a NAME (Full definition is available in users.dtd).

Input data

Matching valid XML file : users.xml

NAME	ID	TYPE
Tim@Home	PWS122	customer
Jack&Moud	MX787	manager
John D'oé	A4Q45	employee

  Tim@Home
  Jack&Moud
  John D'oé

An XML novice developer could write the following code to quickly generate users.xml :

(1) - Serialization to file output stream -

[...]
String ENCODING = "ISO-8859-1";
String[] id = {"PWD122","MX787","A4Q45"};
String[] type = {"customer","manager","employee"};
String[] desc = {"Tim@Home","Jack&Moud","John D'oé"};
PrintWriter out = new PrintWriter(new FileOutputStream("users.xml"));
out.println("");
out.println("users.dtd\">");
out.println("");
for (int i=0;i {
out.println(""+desc[i]+"");
}
out.println("");
[...]

Unfortunately, this code does not generate valid XML. Look at "Jack&Moud" username, it has to be translated to "Jack&Moud" because of '&' special character (called predefined entity). In addition to these specials characters, the developer has to take into account XML processing instructions (i.e. : ), entity references, comments and cd-data section (An XML reference syntax is available in PDF format here).

Obviously, Java XML libraries (parsers and/or processors) should be used to generate well-formed and valid XML. The first solution creates a DOM document from scratch and serialize it to XML. The following code uses Xerces to do so :

(2) - DOM + Xerces serialization to file output stream -

import java.io.*;
// DOM
import org.w3c.dom.*;
// Xerces classes.
import org.apache.xerces.dom.DocumentImpl;
import org.apache.xml.serialize.*;
[...]
Element e = null;
Node n = null;
// Document (Xerces implementation only).
Document xmldoc= new DocumentImpl();
// Root element.
Element root = xmldoc.createElement("USERS");
String[] id = {"PWD122","MX787","A4Q45"};
String[] type = {"customer","manager","employee"};
String[] desc = {"Tim@Home","Jack&Moud","John D'oé"};
for (int i=0;i {
  // Child i.
  e = xmldoc.createElementNS(null, "USER");
  e.setAttributeNS(null, "ID", id[i]);
  e.setAttributeNS(null, "TYPE", type[i]);
  n = xmldoc.createTextNode(desc[i]);
  e.appendChild(n);
  root.appendChild(e);
}
xmldoc.appendChild(root);
FileOutputStream fos = new FileOutputStream(filename);
// XERCES 1 or 2 additionnal classes.
OutputFormat of = new OutputFormat("XML","ISO-8859-1",true);
of.setIndent(1);
of.setIndenting(true);
of.setDoctype(null,"users.dtd");
XMLSerializer serializer = new XMLSerializer(fos,of);
// As a DOM Serializer
serializer.asDOMSerializer();
serializer.serialize( xmldoc.getDocumentElement() );
fos.close();
[...]

This solution works nice for small XML files but it should be avoided for big files because it's memory consuming. Indeed, the full DOM document (Element, Node, Attributes ...) is in memory before being serialized. It's the major drawback of DOM.

A memory-friendly solution is to serialize the XML file on the fly through SAX.
The code below uses Xerces to do so :

(3) - SAX + Xerces serialization to file output stream -

import java.io.*;
// Xerces 1 or 2 additional classes.
import org.apache.xml.serialize.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
[...]
FileOutputStream fos = new FileOutputStream(filename);
// XERCES 1 or 2 additionnal classes.
OutputFormat of = new OutputFormat("XML","ISO-8859-1",true);
of.setIndent(1);
of.setIndenting(true);
of.setDoctype(null,"users.dtd");
XMLSerializer serializer = new XMLSerializer(fos,of);
// SAX2.0 ContentHandler.
ContentHandler hd = serializer.asContentHandler();
hd.startDocument();
// Processing instruction sample.
//hd.processingInstruction("xml-stylesheet","type=\"text/xsl\" href=\"users.xsl\"");
// USER attributes.
AttributesImpl atts = new AttributesImpl();
// USERS tag.
hd.startElement("","","USERS",atts);
// USER tags.
String[] id = {"PWD122","MX787","A4Q45"};
String[] type = {"customer","manager","employee"};
String[] desc = {"Tim@Home","Jack&Moud","John D'oé"};
for (int i=0;i {
  atts.clear();
  atts.addAttribute("","","ID","CDATA",id[i]);
  atts.addAttribute("","","TYPE","CDATA",type[i]);
  hd.startElement("","","USER",atts);
  hd.characters(desc[i].toCharArray(),0,desc[i].length());
  hd.endElement("","","USER");
}
hd.endElement("","","USERS");
hd.endDocument();
fos.close();
[...]

SAX is an event-based API. Basically, to read an XML file, you implement ContentHandler interface. startElement(), characters(...), endElement() methods are called when XML file is parsed. Xerces provides a reverse solution. Developer calls startElement(...), characters(...), endElement(...) SAX methods to generate the XML file.

Samples above need Xerces library but what about others libraries : Crimson, JDOM,
Xalan2, JDK 1.4 .... ? How to select an XML parser and/or an XML processor ? How to make a java code not dependant from a specific library ?
A good solution is JAXP 1.1. JAXP 1.1 is an API developped by SUN. It allows to plug any XML parser and/or processor and "write once, run anywhere" your Java/XML code. Implementation of this API is available in JDK1.4 and provided by Xalan2 too.
Here are similar samples for XML generation with JAXP 1.1 :

(4) - JAXP + DOM + Serialization to servlet output stream : JDK 1.4 compliant -

import java.io.*;
// DOM classes.
import org.w3c.dom.*;
//JAXP 1.1
import javax.xml.parsers.*;
import javax.xml.transform.*;
import javax.xml.transform.stream.*;
import javax.xml.transform.dom.*;
[...]
// PrintWriter from a Servlet
PrintWriter out = response.getWriter();
// Create XML DOM document (Memory consuming).
Document xmldoc = null;
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
DOMImplementation impl = builder.getDOMImplementation();
Element e = null;
Node n = null;
// Document.
xmldoc = impl.createDocument(null, "USERS", null);
// Root element.
Element root = xmldoc.getDocumentElement();
String[] id = {"PWD122","MX787","A4Q45"};
String[] type = {"customer","manager","employee"};
String[] desc = {"Tim@Home","Jack&Moud","John D'oé"};
for (int i=0;i {
  // Child i.
  e = xmldoc.createElementNS(null, "USER");
  e.setAttributeNS(null, "ID", id[i]);
  e.setAttributeNS(null, "TYPE", type[i]);
  n = xmldoc.createTextNode(desc[i]);
  e.appendChild(n);
  root.appendChild(e);
}
// Serialisation through Tranform.
DOMSource domSource = new DOMSource(xmldoc);
StreamResult streamResult = new StreamResult(out);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer serializer = tf.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING,"ISO-8859-1");
serializer.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM,"users.dtd");
serializer.setOutputProperty(OutputKeys.INDENT,"yes");
serializer.transform(domSource, streamResult);
[...]

(5) - JAXP + SAX + Serialization to servlet output stream : JDK 1.4 compliant -

import java.io.*;
// SAX classes.
import org.xml.sax.*;
import org.xml.sax.helpers.*;
//JAXP 1.1
import javax.xml.parsers.*;
import javax.xml.transform.*;
import javax.xml.transform.stream.*;
import javax.xml.transform.sax.*;
[...]
// PrintWriter from a Servlet
PrintWriter out = response.getWriter();
StreamResult streamResult = new StreamResult(out);
SAXTransformerFactory tf = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
// SAX2.0 ContentHandler.
TransformerHandler hd = tf.newTransformerHandler();
Transformer serializer = hd.getTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING,"ISO-8859-1");
serializer.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM,"users.dtd");
serializer.setOutputProperty(OutputKeys.INDENT,"yes");
hd.setResult(streamResult);
hd.startDocument();
AttributesImpl atts = new AttributesImpl();
// USERS tag.
hd.startElement("","","USERS",atts);
// USER tags.
String[] id = {"PWD122","MX787","A4Q45"};
String[] type = {"customer","manager","employee"};
String[] desc = {"Tim@Home","Jack&Moud","John D'oé"};
for (int i=0;i {
  atts.clear();
  atts.addAttribute("","","ID","CDATA",id[i]);
  atts.addAttribute("","","TYPE","CDATA",type[i]);
  hd.startElement("","","USER",atts);
  hd.characters(desc[i].toCharArray(),0,desc[i].length());
  hd.endElement("","","USER");
}
hd.endElement("","","USERS");
hd.endDocument();
[...]

This last sample might be the best solution because it uses JAXP 1.1 so it will work under JDK 1.4 or JDK 1.2/1.3 with XALAN2 library (or any XML library JAXP 1.1 compliant). It's also memory-friendly because it doesn't need DOM.

Assuming that you're not convinced that you should use JAXP, think about Servlet Engines (Tomcat, Resin ...) and Application servers (Websphere, Weblogic ...) .Most of them already include an XML parser/processor and try to add a new one could generate seal errors. In conclusion, well-formed XML generation is not so easy for a novice programmer. A smart, standard, portable, memory-friendly, solution should be based on JAXP 1.1 + SAX.

专心练剑 2008-01-26 10:33 发表评论

Software Tools for NLP

专心练剑 — Thu, 17 Jan 2008 08:52:00 GMT

Software Tools for NLP

Software Archive

General Information

Sourcebank - a search engine for programming resources.
Resources related to content analysis and text analysis - Software
Some publically available NLP packages
SAL (Scientific Applications on Linux)
Public Domain Generic Tools: An Overview - a paper written by Tomaz Erjavec
A collection of online interactive CL tools (Computational Linguistics Group, University of Zurich)
The LINGUIST List: Software
The Natural Language Software Registry
Language Software Helpdesk
- Frequently Asked Questions
PennTools - Computational Linguistics Resources At Penn.
Parsing Resources
Taggers online, email message containing addresses
Parsers and Taggers Information (by Steven Paul Abney)
Relator Language Processing Resources
Corpus Search Tools
Neural Networks & Statistics: Software

Tagger, Morphological Analyzer

A Perl/Tk text tagger
Conexor
Cogilex R&D inc - Makers of expert tools for natural language processing
CLAWS part-of-speech tagger
TnT - Statistical Part-of-Speech Tagging
POS tagger for Spanish
Tagging and Parsing tools
AUTASYS - A Fully Automatic English Wordclass Analysis System
TOSCA/LOB tagger
Relaxation Labelling Based Multi-Tagger
The QTAG Part of Speech Tagger
QTAG: A portable Parts of Speech Tagger
The Alvey Natural Language Tools
The XTAG Project
TreeTagger - a language independent part-of-speech tagger
Xerox Part-of-Speech Tagger
The Edinburgh/Cambridge Morphological Analyser System
Winbrill - An adaptation of Brill's tagger to Windows 95/98.
Eric Brill's Part of Speech Tagger
Software Plaza: Brill's Tagger
Morphy - An integrated tool for German morphology and statistical part-of-speech tagging.
Korean Morphological Analyzer
Natural Language Tools - Japanese morphological analyzer (JUMAN) and parser (KNP) developed by Nagao Lab. at Kyoto University, Japan.
WordSmith Tools - Wordsmith Tools is the Swiss Army knife of lexical analysis - an integrated suite of programs for looking at how words behave in texts. It is intended for linguists, language teachers, and anyone who needs to examine language.
- Mike Scott's Home Page
- Oxford University Press
A Lexical Analyzer for HTML and Basic SGML
ARIES Natural Language Tools - Lexical platform for the Spanish language.

Stemmer

Collocation

Xtract - Frank Smadja's Collocation Extractor.

Parser

Malaga - a system for automatic language analysis
Attribute-Logic Engine (ALE) System and Grammars - A freeware logic programming and grammar parsing system.
CG Parser - Natural deduction categorial grammar and lambda-calculus parser.
Head-Corner Parser (by Gertjan van Noord)
A basic parser written to illustrate the bottom up parsing algorithms in Natural Language Understanding, Second Edition
Cass Partial Parser
CHILL: An empirical parser acquisition system using inductive logic programming
ISSCO Tools - Left-head-corner Island Parser Compiler, etc.
Georgetown University Natural Language Processing
Parser Modularity Demo page
PC-PATR: A syntactic parser
IMS Stuttgart: The CUF Web Page - Comprehensive Unification Formalism
Apple Pie Parser - The Apple Pie Parser is a bottom-up probabilistic chart parser which finds the parse tree with the best score by best-first search algorithm.
Link Grammar Parser

Corpus Tools

WebCorp
Concordances: Producing and Using them
XCES: Corpus Encoding Standard for XML
RST Tool - An RST (Rhetorical Structure Theory) Markup Tool.
RST Annotation Tool
Qwick - corpus browser
Linguistic Annotation - This page describes tools and formats for creating and managing linguistic annotations.
Alembic Workbench - a suite of tools for the analysis of a corpus, along with the Alembic system to enable the automatic acquisition of domain-specific tagging heuristics.
The System Quirk - Workbench for Terminology, Lexicography and Text Analysis.
Multext: Multilingual Text Tools and Corpora
XCorpus - An Environment for Managing Corpus and Multilingual Web Server
The IMS Corpus Toolbox Webpage
X
Kobe Phoenix Laboratory - Corpus Wizard program.
Concordance - A program for Windows NT 4.0 and Windows 95/98 which makes wordlists, concordances, and Web Concordances from your electronic texts.
MonoConc (concordance program)
MonoConc for Windows (concordance program)
Text Analysis Computing Tools (TACT)
The Lingua Project: The World of MultiLingual Parallel Concordancing
(http://prune.loria.fr/~bonhomme/lingua/)
- Sentences alignment tool in multilingual corpora.
The Lingua Project: The World of MultiLingual Parallel Concordancing
(http://www.loria.fr/exterieur/equipe/dialogue/lingua/)
Textual Corpora and Tools for their Exploration

Language Modeling

HMM

A HMM mini-toolkit (by Anand Venkataraman)
HMM Software
see also: Exercise: Using a Hidden Markov Model
Discrete HMM Toolkit
Hidden Markov Model (HMM) Toolbox
Meta-MEME: Motif-based Hidden Markov Models of Biological Sequences

Language Identification

FSA Tools

Finite State Utilities
Automata Learning from Theory to Practice
- Downloadable Software
Index to finite-state machine software, products, and projects
FSA utilities
- FSA Utilities: A Toolbox to Manipulate Finite-state Automata
Grail - a symbolic computation environment for finite-state machines, regular expressions, and other formal language theory objects.
AMoRE - A program for the computation of Automata, Monoids, and Regular Expressions.

Speech

HTK: Hidden Markov Model Toolkit
CSLU Toolkit
The Epos Speech Synthesis System
ISIP public domain speech to text system
- The ISIP Automatic Speech Recognition Toolkit
CSLU Toolkit (Center for Spoken Language Understanding, Oregon Graduate Institute of Science and Technology)
Computer generation of accent marks
Spoken Natural Language Processing Group Software
CMU Error Analysis Toolkit
Audio Tools
VOICEBOX: Speech Processing Toolbox for MATLAB

Mathematical Software

NIST Guide to Available Mathematical Software

Statistics

Bayesian inference Using Gibbs Sampling
CoCo - A statistics package for analysis of associations between discrete variables.

Machine Learning

Machine Learning Toolbox (MLT)
The Machine Learning Programs Repository
The RIPPER rule learner
mFOIL - An ILP systems designed to handle noisy examples.

Support Vector Machine

Information Retrieval & Filtering

seft - a Search Engine For Text
MG - Managing Gigabytes
Isearch - software for indexing and searching text documents.
SMART Software and test collections (Cornell University)
- see also SMART links
Doug Oard's Research Software Page - SMART Modifications
Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering
ifile - A general mail filtering system.
IR-STAT-PAK - A program to compute descriptive and analytic statistics for the TREC IR trials.
Yavi - A visual interface to textual information.
Labeled data sets for information extraction

String/Pattern Matching

Online Approximate String Matching
Strmat package (exact string matching and suffix trees)

Sentence Boundary Detector

Clustering/Classification

WWW

w3mir - HTTP copying and mirroring tool.
HTTrack - The Web mirror utility.
HTML Conversion, Shareware and Freeware

Other Tools

German Morphology Browser (online service)
'mat2D' Matrix/Vector Library in C
Content Analysis Resources - for quantitative analyses of texts, transcripts, and images.
SNoW learning program
The µ-TBL Homepage - Logic Programming Tools for Transformation-Based Learning
ROOT: An Object-Oriented Data Analysis Framework
CAQDAS Networking Project - Computer Assisted Qualitative Data Analysis Software
Suffix sort
Nb - a graphical user interface for annotating the discourse structure of spoken dialogue, monologue, and text.
GATE - General Architecture for Text Engeneering.
TiMBL: Tilburg Memory Based Learner
MtRecode - The Multext character translation program
Evalb - A bracket scoring program. It reports precision, recall, non crossing and tagging accuracy for given data.
The OC1 decision tree software system
IND Version 2.0 - creation and manipulation of decision trees from data
Paai's text utilities
Shoebox 3.0 for Windows and Macintosh - A database program oriented to the needs of a field linguist's dictionary.
Teaching materials for statistical NLP by Chris Brew, Language Technology Group, Human Communication Research Centre, University of Edinburgh
Introducing environmentalism and post-fordism into NLP (NeuroTran)
Tools for Estonian Language
Dan Melamed's Page - Simulated Annealing Program, XTAG morpholyzer post-processors for English Stemming, Good-Turing Smoothing Software, 150 miscellaneous text processing tools, 75 text statistics and bitext geometry tools.
TOOLDIAG: Pattern recognition toolbox
The DN2 Home Page - DN2 is an intelligent self-relating free format database system which accepts data in human text format, and retrieves it in response to human requests, like Where is London?
Software Announcements
Tools for drawing and graphically editing trees
Paul Nation's vocabulary programs
syllable prediction code (a simple lisp function)
Pratt - a pattern discovery tool
XGobi - A system for multivariate data visualization.
NODElib - Neural Optimization Development Engine library

专心练剑 2008-01-17 16:52 发表评论

OpenSource: AI

专心练剑 — Fri, 14 Dec 2007 07:58:00 GMT

RapidMiner (YALE) -- Java Data Mining

(aka YALE) data mining, machine learning, knowledge discovery, business intelligence in Java. 400+ operators: data mining incl. Weka,learning,preprocessing,validation,visualization. GUI,API,XML,analysis,knowledge discovery,databases,business intelligence

Java Object Oriented Neural Engine

Joone is a neural net framework written in Java. It's composed by a core engine, a GUI editor and a distributed training environment and can be extended by writing new modules to implement new algorithms or architectures starting from base components

Sesame

Sesame is a Java framework for storing, querying and inferencing for RDF. It can be deployed as a web server or used as a Java library. Features include several query languages (SeRQL and SPARQL), inferencing support, and RAM, disk, or RDBMS storage.

JGAP

JGAP is a Genetic Algorithms and Genetic Programming package written in Java. It is designed to require minimum effort to use, but is also designed to be highly modular. JGAP features grid functionality and a lot of examples. Many unit tests included.

Mandarax

Mandarax is a pure Java implementation of a rule engine. It supports mutiples types of facts and rules based on reflection, databases, EJB etc, supporting XML standards (RuleML 0.8). It provides a J2EE compliant inference engine using backward chaining.

OWL API

A Java interface and implementation for the W3C Web Ontology Language (OWL), used to represent Semantic Web ontologies. The API is focused towards OWL Lite, OWL DL and OWL 1.1 and offers an interface to inference engines and validation functionality.

Bayesian Network tools in Java (BNJ)

Bayesian Network tools in Java (BNJ) is an open-source suite of software tools for research and development using graphical models of probability. It is published by the Kansas State University Laboratory for Knowledge Discovery in Databases (KDD).

JAGA - Java API for Genetic Algorithms

Java API for implementing any kind of Genetic Algorithm and Genetic Programming applications quickly and easily. Contains a wide range of ready-to-use GA and GP algorithms and operators to be plugged-in or extended. Includes Tutorials and Examples.

RebeccaAIML, Enterprise AIML platform

RebeccaAIML is an enterprise cross platform open source AIML development platform. RebeccaAIML supports C++, Java,C#, and Python as well as many other programming languages and AIML development out of the box with Eclipse.

Neural Network Utility

nn-utility is a neural network library for C++ and Java. Its aim is to simplify the tedious programming of neural networks, while allowing programmers to have maximum flexibility in terms of defining functions and network topology.

Algernon-J

Algernon is a rule-based reasoning engine written in Java. It allows forward and backward chaining across Protege knowledge bases. In addition to traversing the KB, rules can call Java functions and LISP functions (from an embedded LISP interpreter)

jason

Jason is a fully-fledged interpreter for an extended version of AgentSpeak, a BDI agent-oriented logic programming language, and is implemented in Java. Using SACI or JADE, a multi-agent system can be distributed over a network effortlessly.

nlpFarm

The nlpFarm is a Natural Language Processing (NLP) resource where early research prototypes (Java) can evolve into robust and useful open source. Our farmstead collaborates under the OpenNLP initiative, in order to make NLP software publically available.

robotrader

Simulation platform for automated stock exchange trading. It delivers statistics to analyse performance on historic data and allows comparison between trading strategies, that can be coded in Java.

专心练剑 2007-12-14 15:58 发表评论

JSP编程规范(简洁版)

专心练剑 — Fri, 26 Oct 2007 05:39:00 GMT

。整个jsp/jsp bean表示层应当尽可能的瘦和简单化。
。牢记大多数的JSP都应当是只读的视图，而由页面bean来提供模型。

。应当一起设计JSP和JSP bean

。在尽可能合理的情况下，把业务逻辑从JSP中移走。具体于HTTP的逻辑（如，对Cookie的处理）属于bean或支持类中，而不是JSP中。

。尽量把条件逻辑放在控制器中而不是放在视图中。

。为JSP、包含的文件、JSP Bean和实现扩展标记的类使用遵循标准的命名惯例。如：
        jsp控制器   xxxxController.jsp
       被包含的: jsp _descriptiveNameOfFragment.jsp
        jsp页面模型bean:   Bean 如loginBena.Java
        jsp会话bena:   xxxxSessionBean
        标记类 : xxxxTag,xxxxTagExtraInfo

。应当在JSP中避免使用页面引入指令。import指令会促使类的实列化而不是jsp bean的实例化：
不用：<%@ page import = "com.Java.util.*" %>
而用：<% Java.util.List l = new Java.util.LinkedList(); %>

。jsp不应该直接去访问请求参数。bean应当执行这样的处理过程并且输出所处理的模型数据。

。jsp不应当访问属性文件，或者使用JNDI。bean可以访问属性。

。如果jsp bean具有的所有的属性不能够从页面请求中被映射到，就要尽力在标记中设置属性。

。应当避免设计既显示表单又处理结果的页面。

。在jsp中避免代码重复。把要重复的功能放在一个包含的jsp、bean或标记扩展中，使得它能够被重用。

。jsp bean应当永远不要去产生HTML 编程大本营HTTp://www.timihome.net

。在jsp中应该避免使用out.pringln()发放来产生页面内容。

。jsp层不应该直接访问数据，这包括JDBC数据库访问和EJB访问。

。在长度上，scriptlests的代码最好不要超过5行。

。除了jsp bean之外，jsp不应当去实例化复杂的可读写的对象。如果这样的话，就有可能在jsp中去执行不适当的业务逻辑。

。jsp bean中不应当包含大量的数据。

。如果使用了和，并且必须使用简单类型的值来与外部页面进行通讯的话，就应当使用一个或多个元素

。定制标记应当用在适当把逻辑从jsp中移走的地方。

。应当谨慎地使用标记，在jsp中它是一个等价的goto。

。应当使用隐藏的注释来阻止输出的HTML过大。

。在jsp中避免进行异常处理

。每个jsp文件中都应当使用一个错误页面来处理不能够从中恢复的异常。

。在jsp错误页面中，使用HTML注释来显示传递到该页面中的异常跟踪信息。

。只有在能够获得性能上的好处时，才使用jspInin()方法和jspDestroy()方法。获取和放弃资源是jsp beans和标记处理器的事，而不是由jsp来负责的。

。如果没有充分的理由，就不要在jsp中定义方法和内部内。

专心练剑 2007-10-26 13:39 发表评论

JSP编程规范

专心练剑 — Fri, 26 Oct 2007 05:34:00 GMT

概要：
随着JSP规范的不断进展，以及可用的jsp开发工具数量不断增多，以及jsp技术可涉及领域的不断的扩展，促进了基于 jsp技术的高维护性能和标准化的网络应用的开发。这篇文章讨论了在jsp进展的一些主要的内容以及这些是如何更加容易的开发处健壮的JSP网络应用。
这篇文章的最佳实践将能够帮助应用JSP强大的功能以及能够让你为将来JSP的升级做好准备。

JSP规范支持JSP pages同样也支持JSP document．。两者之间主要的区别是它们对XML兼容的程度。JSP pages使用传统的或者说是“速记（shorthand）”语法，而JSP document．用的语法完全与XML相兼容。JSP document．时候被成为是使用了XML语法的JSP pages。但是这里我将分别称它们为JSP pages和JSP document．便加以区分。

基于以下几个原因我推荐使用JSP document．
JSP document．很好组织了的XML\HTML(You can easily verify JSP document． as well-formed XML/HTML)
可以使用XML Schema来验证JSP document． l 可以很容易的使用标准的XML工具来写和解析
可以使用XSLT（Extensible Stylesheet Language Transformations）以不同的form来编写JSP document．具体请看“JSP document．nbspwith XSLT” http://www.javaworld.com/javaworld/jw-07-2003/jw-0725-morejsp.html?#sidebar1 需要什么来搜一搜吧so.bitsCN.com
JSP使用了XML相容include和forward action，custom标签，因而使得整个document．XML相容，这样就提高了编码的一致性。
JSP document．相对JSP pages需要稍微多一点的开发规则，但是带来的好处是更加容易阅读和维持的document．，特别是对于刚刚开始学习JSP的人来说。

关于创建JSP document．和其特点的详细内容请参考“Write JSPs in XML Using JSP1.2”（http://www.javaworld.com/javaworld/jw-07-2003/jw-0725- morejsp.html?#resources）
JSP document．最大的缺点是没有与XML相兼容的JSP注释存在。JSP document．以使用客户端的注释（HTML-/XML –style）或者是嵌入的java注释。但是没有JSP document．<%--　--> 而JSP可用的上面的两种注释方法都有其自身的缺点。你可以在得到的网页中看到客户端的注释（通过浏览器视图里面的“查看源文件”功能），而且要使用 java的注释需要将java代码直接的写在JSP document．中。

在本文剩下的章节中，我将使用JSPs来代表JSP pages和JSP document．，因为我所讨论的最佳实践同样的适用这两种形式的JSP。

使用JSP的编码规范
无论使用任何一种语言，创建的任何工程，在提高开发，维护，和测试你的软件的角度遵循编码的标准和规范都是很明智的选择。读其他开发人员的代码并不简单而且也不是愉快的事情。但是，如果所有的开发人员都遵循同样的命名规范和其他的一些约定的化，阅读代码和维护就会使得阅读代码对他人和编程人员自己变的容易一些。 bbs.bitsCN.com国内最早的网管论坛

Sun Microsytem 最近已经帮助一些组织来创建这样的规范，制定了文档“Code Conventions for the JavaServer Pages Technology Version 1.x Language”可以免费获得，参考“Resources”(http://www.javaworld.com/javaworld/jw-07- 2003/jw-0725-morejsp.html?#resources)。如果你的公司还没有遵循JSP编程规范的话，我建议使用这个文档作为一个起点。你可以完全的遵照该文档也可以在其基础上创建自己的规范。

为对象选择合适的Scope
JSP 规范支持四种scope（应用application,会话session，请求request和页面page）,在JSPs中你可以为创建的对象选择其中的一种，因为绑定到这些scope的对象消耗内存，并且在有些时候需要释放，所以最好选择适当的scope来完成你的任务。

应用范围（Application scope）
Application scope 是最为广泛的一个范围，应该在必要的时候才采用这种形式。你可以在非会话相关（session-aware）的JSPs中创建绑定到 application的对象（You can create objects bound at application level in JSPs that are not session-aware,）在这种类型的JSPs中可以用应用范围来存储数据和信息。（ so application scope is useful for storing information when using these types of JSPs）。你也可以使用绑定到application的对象用来在不同的会话（session）间共享数据。当你不需要application范围的对象的时候一定要显式的删除它们以便释放内存。 bitsCN_com关注网管是我们的使命

会话范围（session scope）
在我的经验中，会话范围要比应用范围用的多。会话范围允许你创建并且将对象绑定到一个会话上面。你必须在session-aware的JSPs中创建绑定在会话的对象并且使在同一个会话中所有的JSP和servlet能够访问到这些对象。会话范围常常用在管理安全验证和管理多个页面的状态信息。绑定在会话范围的对象在不需要的时候也要显示的删除。当我计划将某个类的对象绑定到会话范围的时候我通常会使该类可串行化。

请求范围（request scope）
在绑定对象的时候，页面范围我用的最多。此类对象只在同一个请求的页面间有效。在请求处理完成的时候这些对象将会自动的被释放。因而不需要显式的释放它们，这样就没有了使系统被一些不必要内存消耗而拖累的危险。

页面范围（page scope）
当你创建只对当前页面相关的对象的时候你需要选择页面范围。和请求范围一样，绑定在页面范围的对象不要显式的删除。我很少在我的JSP应用中使用“页面范围”，但是这是的默认范围。

选择哪种范围（scope）
需要仔细的选择创建对象的范围来保证有效的利用内存，通常我会在刚刚开始的时候选择请求范围，然后在评估是否需要选择范围更大的范围。 bbs.bitsCN.com国内最早的网管论坛

仔细的管理会话范围
前面已经提到过，只有在必要的时候才选择会话范围并且当这些对象不在需要会话级访问的时候需要显式的去掉对象的其会话范围。当不使用会话范围的对象的 JSP中你可以设置页面的directive的session属性为false，这样可以避免管理会话范围。但是，很少的网络应用不需要会话范围的支持。通常，我使用会话来支持安全机制以及其他的一些应用需求。尽管一个会在一个可以由你配置的时间后过期，但是在不需要对象的会话范围的时候最好显式的取消它们，而不是依赖会话自动释放的功能。

采用JSTL（标准标签库）
JSP的引入和采纳已经成为JSP开发人员的一个最为重要的进步。JSTL有时候也称为“JSP Standard Tag Library”。在JSTL中的T代表的是标签（Tag）而不是模板（Template）。

JSTL：背景与回顾
在我以前的文章里，我提到过JSP开发人员采纳可以得到的自定义标签库而不是自己从头开始创建。有许多的商业的或者开源的自定义标签库现在已经可以加以利用。但是有一个缺点就是：开发人员需要在JSP中按照这些自定义标签库所特定的格式来应用这些标签。JSTL的出现解决了这个问题，因为JSTL提供了自定义标签的标准接口，这些标签足以满足JSP开发人员的一些基本的要求。（The advent of JSTL has addressed this downside by providing standard interfaces to the custom tags that perform many basic functions JSP developers need.）不同的供应商可能以不同的形式实现这些JSTL标签，但是JSP开发人员不要知道实现标签时的不同点。如果JSP开发人员使用JSTL编写了JSP page或者JSP document．JSP page或者JSP document．该适用所有的JSTL实现方法。需要什么来搜一搜吧so.bitsCN.com
有许多有价值的书和一些在线的资源可以去学习JSTL。这里我将主要简单的介绍JSTL的优点与特性。

JSTL的优点
简短的说，JSTL提供了所有的已经公布的自定义标签库所有的好处，而且提供标准化的标签API。JSTL促进了高可维护性和可移植性的pages和document．。我列出了JSTL一些特别的特点。
JSTL提供了基于标签的遍历，条件以及其他一些功能，这些功能以前或者是直接在JSP中嵌入代码来实现的，或者是使用了自己创建的标签，非标准的标签库，或者是通过使用Servlet来代替JSP来实现的。
JSTL使用了EL（expression language）语法
编写自定义标签相对其他一些JSP开发任务来说需要更多的精力与经验。JSTL通过两种方法来简化这些步骤：首先，如前所述，jstl能解决很多定制(自定义)的tags的需要.（JSTL handles many common needs for custome tags）。其次，JSTL提供了一些机制使得编写你自定义的标签更简单，尤其是编写支持EL自定义标签的时候。

具体的JSTL特性与优点
下面简单的概括JSTL4个可用自定义标签库中三个标签库的一些优点，并且给出了不推荐使用数据库访问标签库(database access library)的原因。同样我也讨论了使用EL的优点。

bitsCN_com关注网管是我们的使命

数据访问标签库（Database access library）
JSTL提供了数据访问标签库，但是我很少用它，因为我强烈的认为不应该在JSP页面内直接访问数据库。如果在JSP中直接的访问数据库将会降低重用，因为数据库访问的代码在使用数据库范围标签的JSP页面外是不可以被访问到的。在JSPs中直接的进行数据库访问将会加大表示层与数据层之间的耦合。严格的分割意味着更好的模块化，复用性，以及更容易的满足表现层和数据层之间的规范（Disciplined separation means more modularity, greater opportunity for reuse, and better opportunities for specialization of presentation and database experts）。.我推荐在JSTL的其他三种标签库可以满足JSP开放人员的需求的时候使用这些标签库，但是我不推荐使用JSTL的数据库范围标签库 outside of prototypes and the simplest Web applications。

JSTL 核心标签库(

利用servlet filter的特点
servlet filter是Servlet2。3规范中引入的，但是这些filter同样有利于JSP开发和维护。因为JSPs需要被转换成servlets， JSPs与servlet技术紧密相关。因此servlet规范的重要发展会影响到JSP的发展，对此你不应该感到奇怪。
Servlet filters是Intercepting Filter模式的J2ee实现，因此提供了这个模式的所提供的特点，包括更好的维护性，少的代码冗余以及更好的可移植性。这是因为：通常你需要加入服务到每个jsp页面中，而现在可以通过将这些服务放到一个filter中。并且这些JSPs根本不需要这些filter的存在。因为在可插入的 filters与JSPs之间没有关联性，因此在filter中的修改将不会直接影响到JSPs。你可以使用filter链，使用不同的filter的组合，每个filter用来实现不同的目的。

JSP网络应用中servlet filter的作用
下面的两个例子说明了在基于JSP的网络应用中servlet filter的作用。在许多的安全配置中，每一个JSP页面都会检验会话ID和其他一些安全性来授权一个JSP调用。你可以将这些在每个JSP页面中的检验代码移植到一个servlet filter中，并且确保这个filter在调用每个jsp页面之前被调用。这样就提高了JSPs的可维护性和可移植性。你可以仅仅的在这一个 servlet中进行一些安全检验方面的修改，或者是在其中加入一些和安全相关的代码。而不是在每一个JSP页面中进行修改。如果将来整个安全机制改变了，系统中唯一要修改的地方仅仅是这个filter，独立的JSP页面将不需要任何修改。 www.bitsCN.net网管博客等你来搏
在上一篇“JSP Best Practices”中，我推荐将异常信息存储到“Secondary Storage”中，并且仅仅的提供给用户一个可以检索这些异常信息的一个标志（and only providing the user with an identifier to search the storage for the entire exception trace）。在这种情况下servlet filter非常的有用。你可以通过配置来使网络应用（Web Application）在调用异常JSP时自动的来执行用来记录异常日志的filter。Sevlet规范提出了许多的潜在的servlet filter用法。
为JSPs的创建API文档（document．nbspthe APIs for your JSPs）
Java的许多悦人心意的特点之一便是它支持JavaDoc。通过JavaDoc可以快速而容易的为java代码提供Web-based的文档。不幸的是，javadoc工具不支持JSP，并且JSP规范没有“唤起”一个方法来提供“JSP APIs”。

什么是JSP API？
能够不通过阅读JSP的全部的代码就能够快速的确定一些JSP方面是非常之有用的。比方说，你需要知道哪些变量是绑定到会话(session)，请求 (request)和应用(application)的范围，并且这些变量是具体的被绑定到了具体的哪一个范围之上。另外一个JSP API用处的例子是在JSP segment之中，segment需要知道在被包含的时候，调用它们的JSP中已经声明和制定了哪些变量（Another example of useful JSP API information is denoting in JSP segments which variables they require the calling JSP to have declared and defined when including them）。

bitsCN.com中国网管联盟

JSP规范没有涉及关于如何的建立JSP API的文档。Sun的JSP 1.x 代码公约文档讨论将注释和作者，版权，以及描述的信息一起写在JSPs的上部，但是我喜欢更详细的记录JSPs的期望的输入（but I like to document．nbspmy JSPs' expected inputs more thoroughly）。
因为JSP规范中没有涉及到这些，因此没有一个标准的用来注释JSP API。一个方法是在JSP中使用java代码（scriptlets）并且在代码中嵌入javadoc形式的注释(/** javadoc comment */)。尽管我很少在JSPs中使用java代码，但是这是在服务器端保留这些注释的最简单的方法。使用XML/HTML风格的注释会将JSP API暴露在客户端，这是一个很不好的方法。
我知道有两种免费可以使用的产品可以用来为你的JSPs做注释，SourceForge.net 的JspDoc以及OSDN（Open Source Development Network）的Freshmeat.net的JSPDoc。(关于两种工具的详细情况见resource【http: //www.javaworld.com/javaworld/jw-07-2003/jw-0725-morejsp-p3.html#resources】). 这里我将简要的介绍一下这两个工具。

JspDoc(SourceForge)
SourceForge 的JspDoc可以用来为JSPs生成Javadoc风格的文档。这个工具通过将XML-Compliant的标签放入到Javadoc风格的注释（/** */）之中，而这些注释是放在了JSP page的java代码中。这个工具的缺点是目前它仅仅支持JSP pages，尽管对JSP document．支持已经在计划列表中。 www.bitsCN.net网管博客等你来搏
这个工具还提供了转换JSP pages到JSP document．功能。因为我从一开始就编写JSP document．因为我没有用过这项功能，但是对于想从JSP pages转换到JSP document．用户来说，这是一个很好的工具。还要另外一个功能就是将JSP document．换到JSP pages。

JSPDoc(Freshmeat.net)
Freshmeat的JSP 文档生成器 JSPDoc从JSPs中抽取信息来创建Javadoc风格的基于Web的文档页面。这个工具的一个优点是它能够将产生的JSP文档与用Javadoc工具产生的java类的文档结合起来。缺点是，为了产生注释要求有一个相当严格的注释结构。这个特殊的语法使用了Javadoc的（/** */）但是并不能够识别@符号，而@在标准javadoc是有一定的含义的。另一个缺点就是这个工具不支持XML－compliant的JSP document．而是要求用<%%>的语法结构。This product is available under the Mozilla public license.
JSP document．tion for JSP document．因为JspDoc和JSPDoc都不支持JSP document．我利用JSP document．XML-compliance的特性来产生Javadoc形式的文档。使用XSLT stylesheet，可是很容易的来为JSP document．建HTML页面形式的注释文档。而且不需要自定义的解析。因为当你的JSP是一个正确的XML文档时有标准的工具（比方Xalan）能够进行这些处理。

　　当你在浏览器中键入URL来执行JSP时，JSP在以HTML的形式提交给用户之前需要经历一系列的处理。正是因为这些处理，因此当第一次请求jsp 的时候需要的时间要比其后对这个jsp页面的访问需要的时间要长很多。很多的开发人员都知道在发布的时候预编译JSPs的重要性，同样的，在开发阶段进行预编译也是很有用的。
你可以在编译代码的阶段，在编译与JSP相关的javabean、自定义标签处理类（custom tag handler classes）、其他一些相关的类以及servlet的同时预编译JSP。这样只需要进行一次的编译，减少了某一个时间内需要的编译的时间。对于开发人员来说，这非常有好处，因为在等待编译的时候，他们很容易分心。因此一次性的进行所以的编译相对与只是在请求jsp的时候才进行编译是很有好处的。
预编译可以发现语法问题（parser problem）以及其他一些翻译时期（translation-time）出现的问题。这些问题通常需要多个步骤才能够定位。这样对于开发人员来说是有意义的，这样开发人员就不需要通过浏览多个页面后才可以定位存在问题的页面了。如果使用JSP document．话，那么还可以在预编译的时候来验证JSP document．结构。
预编译的另一个好处是可以在发布的war文件中包含你的编译了的JSP版本，而不是实际的JSP源代码。JSP进行编译后，就可以以.class文件包含在发布的产品中（这些.class文件名满足容器的供应商特定的命名约定）。 bitsCN_com
大多数的Java 2平台，J2EE以及一些java工具都支持JSP预编译，专业的网络容器也支持JSP预编译，尽管可能是通过一种非标准的命令或者界面。许多的网络容器都支持命令行形式的JSP预编译，你可以在你的scripted builds中加入这些命令行。

组织文件和目录
下面的技术有助于JSP的开发与维护，能够使得你的JSP开发和维护更容易和高效：
l 组织Web的根目录
l 组织好WEB－INF目录，合理的使用子目录
l 以.jspf的扩展名来标识JSP fragments（需要被include在其他jsp页面中的jsp文件，译者注）
l 使用IDE，ANT，以及其他一些自动生成工具

组织Web的根目录
你可以通过将所有的Web应用所有的文件直接的放到web的根目录下面，这个目录就是WEB－INF目录所在的目录。我推荐合理的组织这个目录，比方说在其中加入jsp,html,css以及css等子目录。对于简单的应用来说，是否需要这样来划分目录还有争议但是对于大的网络应用来可以增强理解以及维护性能。

组织WEB-INF目录
标签酷是在JSP开发中很有价值的资源。大的网络应用可能包含有几个标签库比方说：JSTL标签库、Struts标签库以及其他的一些标签库。我推荐在WEB-INF目录下面建立一个tld子目录来存放这些标签库而不是将这些标签库放在WEB-INF目录下面。这样可能会“淹没”了这个目录。

需要什么来搜一搜吧so.bitsCN.com

以.jspf的扩展名来标识JSP fragments
在最近版本的JSP规范中的JSP segments（以前版本称为JSP fragments）即.jspf文件是不完整的JSPs，是用来被其他的JSP来包含的。JSP规范建议使用命名规范来区别“外层”的JSPs和JSP fragments/segments。通常将命名完整的“外层”的文件以“.jsp”为扩展名，而JSP fragments/segments以“.jspf”为扩展名，但是规范并没有要求这样做。我同样推荐将完整和“外层”的JSPs放在一个不同的目录下面。

使用IDE，ANT及其他的一些自动工具
IDEs可以加速开发和部署的时间，并且减少书写以及其他的一些错误。有许多的IDE工具提供了J2EE工具和向导。这些工具同样同一些框架相集成（如Struts和JSTL标签库）。
Ant是defacto标准的创建和部署java和j2ee应用的工具。Ant提供了创建和部署应用时很多有用的特性，同样也支持创建和部署war以及 ear文件。许多工具内嵌支持Ant。当不能使用IDEs时，我任务Ant时必不可少的。其他一些工具也可能支持自动创建和部署，也能提供ant提供的特性，但是ant一个最重要的特点就在于它的费用（免费）以及它支持很广泛。
同样我推荐Apache的Apache Maven，在考虑管理整个java项目时它也是一个很有用的产品。 bitsCN_com关注网管是我们的使命

重新考虑与规范不相容（nonspecification-compliant）的特性
Web Server偶尔会提供一些与供应商特定(vendor-specific)的特性，这些特性在开发时非常有用，可以提高性能、安全以及其他一些特性。在有些情况下，使用这些与供应商相关(vendor-specific)的特性是合理的，因为它所带来的优点远远的超过了其所可能蕴涵的危险。然而你需要意识到使用与供应商特定的特性时所蕴含的危险，因此在同样的情况下应该优先的考虑使用和规范相容的特性。记住,并不是所以的特性都是按照规范而“呼唤”出来的，在这种情况下，任何一个供应商的实现都是私有的。
技术的依赖并不总是使得供应商特点的特性蕴含危险。特定的Web servers供应商提供的自定义标签库可能在所有的支持自定义标签的Web server都可以使用，这种情况下你需要注意的是版权（licensing issues）问题了。
最佳实践（best practice）依赖与变换Web servers的可能性。当我不使用tomcat做为web server时，我通常会在其他别的web servers上面部署基于j2ee的网络应用来检验规范的相容性。需要记住的一点是，即使你一直使用一种web server，随着时间的发展，使用供应商特定的特性也存在危险。因为j2ee规范不断的进步，在某一个特定的供应商以他们特有的方式实现了一定的特性的时候，j2ee规范可能就会以一种标准的形式来定义这个特性。这种情况下，这个供应商就会转向这个同一的标准。 play.bitsCN.com累了吗玩一下吧

使用XHTML语法
在“JSP Best Practices”中，我推荐在JSPs中使用HTML的最佳实践（best pratices）。更近一步，我发现在创建JSP document．XHTML规范提供很有用的HTML标签语法（I now find that the XHTML specification offers the most useful version of HTML tag syntax in authoring JSP document．），XHTML使得更容易的来创建XML相容的JSP document．甚至JSPs page的作者也发现了在JSPs中使用XHTML是有好处的。
因为完全的XML相容，XHTML语法比 HTML遵守个严格的规则。标准的HTML和XHTML标签的不同见：World Wide Web Consortium's XHTML 1.0 pages.（http://www.javaworld.com/javaworld/jw-07-2003/jw-0725-morejsp- p4.html#resources.）

只能做的更好（It only gets better）
JSP技术是用来简化灵活的web开发的。近来产生JSTL技术延续了这一趋势。甚至servlet方面规范的进步也大大的方便了JSP的开发。JSP和 servlet规范的进步、一些新的工具的产生、JSP编码标准的共享都使得高可维护的JSP的开发比以前更加的容易。

专心练剑 2007-10-26 13:34 发表评论