Thread Background/Root http://forums.infoworld.com/WebX?50@152.MOElaoRzA4t.1@.ee7680f

Data dilemma Live forum event Thursday, Oct. 31, at 10 a.m. PST

11/01/02 11:53 am 1ca

Managing unstructured data brings up a host of tech concerns: content management, content delivery, search, knowledge management, document management, and more. However, it often seems like simply finding the information you want and need presents the toughest task, especially when it comes to unstructured data. A slew of technologies and vendors are emerging with possible solutions to the unstructured data management problem -- from XML-based standards and souped-up search technology to content management solutions that do a deeper categorization of data types and relationships. How big of a problem is unstructured data management in your company, and how are you approaching it? Which technologies are you relying on to retrieve the information you need? Managing unstructured data may mean hard work on the database and data management side, but many companies feel the strategic advantage found in that data, not to mention improved efficiency, make that effort worthwhile.

Read the special report "Diving into data" (available online Oct. 25), then join the live forum event here with Hadley Reynolds, director of research at Delphi Group, on Thursday, Oct. 31, at 10 a.m. PST.

Readers are encouraged to post their questions and comments to the guest in advance.

========================= Unstructured 2 semistructured 2 structured Login: http://forums.infoworld.com/WebX?14@1.hOh5aoKNAUL.67218@.ee7680f/36 Anonymous: http://forums.infoworld.com/WebX?14@152.MOElaoRzA4t.2@.ee7680f/36

This unstructured data management (UDM) to semi-structured data management (SSDM) to structured data management (SDM) issue has engaged me since my corporate online documentation/training efforts in 1985 through 1989, using hypermedia (e.g., BlackMagic, Hyperwriter, pre-Web hyperlinks and hypermedia authoring) and SGML (tagged online document content for navigation among topics and references). I categorize UDM as raw text and office automation artifacts (word processing, spreadsheets, presentations, diagrams, etc.). I categorize SSDM as slightly-tagged XML documents and data-sets, with or without DTD/Schema. I categorize SDM as SQL and highly-tagged XML with DTD/Schema.

I was doing this initial intelligence transfer while serving in an additional duty as a "CIO" equivalent for a large military organization in conjunction with my formal Comptroller-related "enterprise management" support duties, including restructuring and rejustifying the organization's mission and composition (and thus its documents for functional assignments, references/guidance,and plans), and implementing a mission-based, architecture-driven, WAN/LAN environment with shared information resources such as individual/group/corporate email, calendars, file systems, databases, and directories.

I then took this approach, i.e., a knowledge-based methodology for enterprise spiral life cycle management, enterprise architecture management, and enterprise resource/requirement management, to the next higher headquarters from 89 till 92 and had more success in implementing the UDM to SSDM to SDM effort, with all of the content going into a multi-dimensional database of my design, a superset of what is now implemented in the Object Management Group's (OMG) Managed Object Format (MOF) compliant repositories.

This was also part of my effort in helping with a client's major system/software documentation efforts (over 5000 pages of complex technical specifications, user manuals, etc.). The biggest challenge was getting the technical editors, much less the content authors, to invest in much more than using styles, which was a dramatic improvement over their prior use of little more than text formatting. They did not see significant value in tagging the content for reuse/reference/intelligence use.

The next biggest hurdle I found was that most content authors do not understand that pretty-styled documents, presentations, or spreadsheets are useless as intelligence for operations and access control unless the internal contents/data are tagged with descriptive and unique-identity "containers/metadata", all the way up to the document root, which itself is given a unique "namespace" identity and location/path ("carriers/connectors") and has descriptive metadata. This was when I started using the simplification of describing IT and information in terms of "content, containers, and carriers", or data, metadata/platforms, and networks respectively.

This unstructured content tagging has to start at the word and sometimes character level, extending up to paragraphs (with outline-number attribute) and exhibits (diagrams or tables, each with their own tag structures, including attributes for exhibit outline-numbered caption), within sections (with tags for section-specific headers and footers), up to the root container itself (i.e., the document, presentation, spreadsheet, etc., including tags for the Table of Contents, Index/concordance, Tables of Exhibits, and other structures, along with descriptive attributes.)

This is labor intensive, unless tools can automatically identify keywords (and their morphology as nouns/objects and verbs/relations/actions), keyword clauses (nou/verb/noun combinations), and unique phrases and tag them, identify structures/containers and tag them, and identify root location/idenity and tag that, all from a generalized schema for that type of container (word processing document, presentation, spreadsheet, etc.). These tools are available now, but are still expensive or fragmented. We'll see what the next year or so provides.

But this absolutely has to be done, because the recorded "intelligence" of the enterprise is constantly being refreshed by its active agents (people, processes, devices) and stored in a mixed bag of UDM, SSDM, and SDM. But to make it usable for dynamic operations and resource access, it has to be moved into at least SSDM, and for efficient and effective use, into SDM.

============================== Private, Proprietary, Classified Information All Require deep tagging in a deep namespace Login: http://forums.infoworld.com/WebX?14@1.hOh5aoKNAUL.67218@.ee7680f/37 Anonymous: http://forums.infoworld.com/WebX?14@152.MOElaoRzA4t.2@.ee7680f/37

The same issues that come into the privacy requirements come into the need for protecting proprietary information (e.g., copyrights, trademarks, patents, trade secrets, competitive intelligence) and national/military classification (e.g., classification level, access boundaries, access rights). From my experience and analysis, unless you can provide a mechanism for contextual mission- and knowledge-based designation (as attributes) of deeply tagged content, you cannot control access-to, presentation-of, and interaction-with private, proprietary, or classified intelligence.

This designation would have to encompass the user's relevant mission-assignment/role to a location, organization, organization unit, function, and process that justifies the requirement for some access and some right to that tagged intelligence resource. This is essentially a role-based access control (RBAC) requirement.

=============================== A definition of Intelligence Login: http://forums.infoworld.com/WebX?14@1.hOh5aoKNAUL.67218@.ee7680f/38 Anonymouse: http://forums.infoworld.com/WebX?14@152.MOElaoRzA4t.2@.ee7680f/38

In regards to the above post, I describe "intelligence" to be the relevant collection of contextually-framed: monitored current Situation status (categorized by the management attributes of location, organization, organization unit, function, process, resource, and requirement life cycle state),

situation change Event,

event Signals,

signals framed as Data/alert,

Information as data in context (i.e., data content in its metadata container and carrier, categorized by its management attributes),

Knowledge as information in context (i.e., information categorized by its management attributes),

Awareness as knowledge in context (i.e., subject knowledged customized for the user's management attributes),

Wisdom as awareness in context (i.e., user's awareness of the monitored situation and its events, signals, data, information, and knowledge) as the basis for situational decision making,

New-Situation status, categorized by its management attributes, resulting from the change event,

the logged transactional History (i.e., past intelligence) of the monitored situation, and

the recorded Future contingencies/extrapolations/plans of the monitored situation.