Chapter 7. Text Tables

Text Tables as a Standard Feature of HSQLDB

Bob Preston

The HSQL Development Group

Fred Toussi

The HSQL Development Group

$Revision: 6692 $

Copyright 2002-2024 Bob Preston and Fred Toussi. Permission is granted to distribute this document without any alteration under the terms of the HSQLDB license. Additional permission is granted to the HSQL Development Group to distribute this document with or without alterations under the terms of the HSQLDB license.

2024-10-25

Table of Contents

Overview
The Implementation
Definition of Tables
Scope and Reassignment
Null Values in Columns of Text Tables
Configuration
Disconnecting Text Tables
Text File Usage
Text File Global Properties
Transactions

Overview

Text Table support for HSQLDB was originally developed by Bob Preston independently from the Project. Subsequently Bob joined the Project and incorporated this feature into version 1.7.0, with a number of enhancements, especially the use of SQL commands for specifying the files used for Text Tables.

In a nutshell, Text Tables are CSV or other delimited files treated as SQL tables. Any ordinary CSV or other delimited file can be used. The full range of SQL queries can be performed on these files, including SELECT, INSERT, UPDATE and DELETE. Indexes and unique constraints can be set up, and foreign key constraints can be used to enforce referential integrity between Text Tables themselves or with conventional tables.

The delimited file can be created by the engine, or an existing file can be used.

HyperSQL with Text Table support is the only comprehensive solution that employs the power of SQL and the universal reach of JDBC to handle data stored in text files.

The Implementation

Definition of Tables

Text Tables are defined in a similar way to conventional tables with the added TEXT keyword.

 CREATE TEXT TABLE <tablename> (<column definition> [<constraint definition>])

The table is at first empty and cannot be written to. An additional SET command specifies the file and the separator character that the Text table uses. It assigns the file to the table.

 SET TABLE <tablename> SOURCE <quoted_filename_and_options> [DESC]

Scope and Reassignment

  • A Text table without a file assigned to it is READ ONLY and EMPTY.

  • Reassigning a Text Table definition to a new file has implications in the following areas:

    1. The user is required to be an administrator.

    2. Existing transactions are committed at this point.

    3. Constraints, including foreign keys referencing this table, are kept intact but not checked. It is the responsibility of the administrator to ensure their integrity.

    The new source file is scanned and indexes are built when it is assigned to the table. At this point any violation of NOT NULL, UNIQUE or PRIMARY KEY constraints are caught and the assignment is aborted. However, foreign key constraints are not checked at the time of assignment or reassignment of the source file.

Null Values in Columns of Text Tables

  • Empty fields are treated as NULL. These are fields where there is nothing or just spaces between the separators.

  • Quoted empty strings are treated as empty strings.

Configuration

The default field separator is a comma (,). A different field separator can be specified within the SET TABLE SOURCE statement. For example, to change the field separator for the table mytable to a vertical bar, place the following in the SET TABLE SOURCE statement, for example:

 SET TABLE mytable SOURCE "myfile;fs=|"

Since HSQLDB treats CHAR and VARCHAR strings the same, the ability to assign a different separator to the latter is provided. When a different separator is assigned to a VARCHAR, it will terminate any CSV field of that type. For example, if the first field is CHAR, and the second field VARCHAR, and the separator fs has been defined as the pipe (|) and vs as the period (.) then the data in the CSV file for a row will look like:

 First field data|Second field data.Third field data

This facility in effect offers an extra, special separator which can be used in addition to the global separator. The following example shows how to change the default separator to the pipe (|), VARCHAR separator to the period (.) within a SET TABLE SOURCE statement:

 SET TABLE mytable SOURCE "myfile;fs=|;vs=."

HSQLDB also recognises the following special indicators for separators:

special indicators for separators

\semi

semicolon

\quote

single-quote

\space

space character

\apos

apostrophe

\colon

colon character

\n

newline - Used as an end anchor (like $ in regular expressions)

\r

carriage return

\t

tab

\\

backslash

\u####

a Unicode character specified in hexadecimal

Furthermore, HSQLDB provides csv file support with three additional boolean options: ignore_first, quoted and all_quoted. The ignore_first option (default false) tells HSQLDB to ignore the first line in a file. This option is used when the first line of the file contains column headings or other title information. The first line consists of the characters before the first end-of-line symbol (line feed, carriage return, etc). It is simply set aside and not processed. The all_quoted option (default false) tells the program that it should use quotes around all character fields when writing to the source file. The quoted option (default true) uses quotes only when necessary to distinguish a field that contains the separator character. It can be set to false to prevent the use of quoting altogether and treat quote characters as normal characters. All these options may be specified within the SET TABLE SOURCE statement:

 SET TABLE mytable SOURCE "myfile;ignore_first=true;all_quoted=true"

When the default options all_quoted= false and quoted=true are in force, fields that are written to a line of the csv file will be quoted only if they contain the separator or the quote character. The quote character inside the field is doubled when written out to the file. When all_quoted=false and quoted=false the quote character is not doubled. With this option, it is not possible to insert any string containing the separator into the table, as it would become impossible to distinguish from a separator. While reading an existing data source file, the program treats each individual field separately. It determines that a field is quoted only if the first character is the quote character. It interprets the rest of the field on this basis.

The setting, null_def, can be used to simplify importing text files containing empty fields. These fields are interpreted as null but the user may want an empty string or another default value instead of null. With null_def=true defined in the text source string, and a table column that is defined as DEFAULT <val> NOT NULL with a constant value for the default, the default value will be used instead of any empty or NULL field.

The character encoding for the source file is ASCII by default, which corresponds to the 8-bit ANSI character set. To support UNICODE or source files prepared with different encodings this can be changed to UTF-8 or any other encoding. The default is encoding=ASCII and the option encoding=UTF-8 or other supported encodings can be used. From version 2.3.4, the two-byte-per-character encodings of UTF-16 are also supported. The encoding=UTF-16BE is big-endian, while encoding=UTF-16LE is little-endian. The encoding=UTF-16 is big-endian by default. This encoding reads a special Unicode character called BOM if it is present at the beginning of an existing file and if this character indicates little-endian, the file is treated as such. Note HSQLDB does not write a BOM character to the files it creates from scratch.

Finally, HSQLDB provides the ability to read a text file as READ ONLY, by placing the keyword "DESC" at the end of the SET TABLE SOURCE statement:

 SET TABLE mytable SOURCE "myfile" DESC

Text table source files are cached in memory. The maximum number of rows of data that are in memory at any time is controlled by the cache_rows property. The default value for cache_rows is 1000 and can be changed by setting the default database property. The cache_size property sets the maximum amount of memory used for each text table. The default is 100 KB. The properties can be set for individual text tables. These properties do not control the maximum size of each text table, which can be much larger. An example is given below:

 SET TABLE mytable SOURCE "myfile;ignore_first=true;all_quoted=true;cache_rows=10000;cache_size=1000"

The properties used in earlier versions, namely the textdb.cache_scale and the textdb.cache_size_scale can still be used for backward compatibility, but the new properties are preferred.

Supported Properties

quoted = { true | false }

default is true. If false, treats double quotes as normal characters

all_quoted = { true | false }

default is false. If true, adds double quotes around all fields.

encoding = <encoding name>

character encoding for text and character fields, for example, encoding=UTF-8. UTF-16, UTF-16BE, UTF-16LE can also be used.

ignore_first = { true | false }

default is false. If true ignores the first line of the file

null_def = { true | false }

default is false. If true, replaces any null or empty fields in the text file rows with the column default value of the not-null column

cache_rows= <numeric value>

rows of the text file in the cache. Default is 1000 rows

cache_size = <numeric value>r

total size of the rows in the cache. Default is 100 KB.

cache_scale= <numeric value> and cache_size_scale = <numeric value>

deprecated properties, replaced by cached_rows and cache_size properties above.

fs = <unquoted character>

field separator

vs = <unquoted character>

varchar separator

qc = <unquoted character>

quote character

Disconnecting Text Tables

Text tables may be disconnected from their underlying data source, i.e. the text file.

You can explicitly disconnect a text table from its file by issuing the following statement:

 SET TABLE mytable SOURCE OFF

Subsequently, mytable will be empty and read-only. However, the data source description will be preserved, and the table can be re-connected to it with

 SET TABLE mytable SOURCE ON

When a database is opened, if the source file for an existing text table is missing, the table remains disconnected from its data source but the source description is preserved. This allows the missing source file to be added to the directory and the table re-connected to it with the above command.

Disconnecting text tables from their source has several uses. While disconnected, the text source can be edited outside HSQLDB, provided data integrity is respected. When large text sources are used, and several constraints or indexes need to be created on the table, it is possible to disconnect the source during the creation of constraints and indexes and reduce the time it takes to perform the operation.

Text File Usage

The following information applies to the usage of text tables.

Text File Issues

  • With file databases, text file locations are restricted to below the directory that contains the database, unless the textdb.allow_full_path property is set true as a Java system property. This feature is for security, otherwise an admin database user may be able to open random files. The specified text source path is interpreted differently according to this property. By default, the path is interpreted as a relative path to the directory path of database files, it therefore cannot contain the double dot notation for parent directory. This path is then appended by the engine to the directory path to form a full path.

    When the property is true, and the path starts with the forward slash or back slash, or the path contains a semicolon, the path is not appended to the directory path and is used as it is to open the file. In this usage the path is absolute.

  • By default, all-in-memory databases cannot use text tables. To enable this capability the textdb.allow_full_path property must be set true as a Java system property. The text file path is used as submitted and interpreted as an absolute path as described above, or a path relative to the Java process execute path. These text tables are always read-only.

  • Databases store in jars or as files on the classpath and opened with the res: protocol can reference read-only text files. These files are opened as resources. The file path is an absolute path beginning with a forward slash.

  • Blank lines are allowed anywhere in the text file, and are ignored.

  • It is possible to define a primary key, identity column, unique, foreign key and check constraints for text tables.

  • When a table source file is used with the ignore_first=true option, the first, ignored line is replaced with a blank line after a SHUTDOWN COMPACT, unless the SOURCE HEADER statement has been used.

  • An existing table source file may include CHARACTER fields that do not begin with the quote character but contain instances of the quote character. These fields are read as literal strings. Alternatively, if any field begins with the quote character, then it is interpreted as a quoted string that should end with the quote character and any instances of the quote character within the string is doubled. When any field containing the quote character or the separator is written out to the source file by the program, the field is enclosed in quote character and any instance of the quote character inside the field is doubled.

  • Inserts or updates of CHARACTER type field values are allowed with strings that contains the linefeed or the carriage return character. This feature is disabled when both quoted and all_quoted properties are false.

  • ALTER TABLE commands that add or drop columns or constraints (apart from check constraints) are not supported with text tables that are connected to a source. First use the SET TABLE <name> SOURCE OFF, make the changes, then turn the source ON.

  • Use the default setting (quoted=true) for selective quoting of fields. Those fields that need quoting are quoted, other not.

  • Use the quoted=false setting to avoid quoting of fields completely. With this setting any quote character is considered part of the text.

  • Use the all_quoted=true setting to force all fields to be quoted.

  • You can choose the quote character. The default is the double-quote character.

  • SHUTDOWN COMPACT results in a complete rewrite of text table sources that are open at the time. The settings for quoted and all_quoted are applied for the rewrite.

Text File Global Properties

The database engine uses a set of defaults for text table properties. Each table's data source may override these defaults. It is also possible to override the defaults globally, so they apply to all text tables. The statement SET DATABASE TEXT TABLE DEFAULTS <properties string> can be used to override the default global properties. An example is given below:

 SET DATABASE TEXT TABLE DEFAULTS 'all_quoted=true;encoding=UTF-8;cache_rows=10000;cache_size=2000'

List of supported global properties

  • qc="

  • fs=,

  • vs=,

  • quoted=true

  • all_quoted=false

  • ignore_first=false

  • null_def=false

  • encoding=ASCII

  • cache_rows=1000

  • cache_size=100

  • textdb.allow_full_path=false (a system property)

Transactions

Text tables fully support transactions. New or changed rows that have not been committed are not updated in the source file. Therefore, the source file always contains committed rows.

However, text tables are not as resilient to machine crashes as other types of tables. If the crash happens while the text source is being written to, the text source may contain only some of the changes made during a committed transaction. With other types of tables, additional mechanisms ensure the integrity of the data and this situation will not arise.


$Revision: 6787 $