The History of the TLG®

Searching for Terms of Happiness: How the TLG® Was Born

The Thesaurus Linguae Graecae® or "Treasury of the Greek Language" was conceived and initially funded by Marianne McDonald. In 1971 Marianne McDonald, then a graduate student in Classics at the University of California, Irvine, motivated by her dissertation research on "terms of happiness" in Euripides, proposed the creation of a computerized databank of Greek Literature. The concept was extraordinary since no one until then had considered the marriage of Classical scholarship with the rapidly emerging new technologies.

The creation of the TLG came at the end of a long tradition of scholarly efforts to systematically collect and preserve extant Greek literature. During the 16th century, Henri Estienne (Stephanus), a French scholar and printer, produced a sizable lexicon known as the Thesaurus Graecae Linguae. In the late 1800's the discovery of thousands of new papyri in Egypt and the rise of modern textual criticism motivated European scholars to undertake the creation of new lexica for Greek and Latin. The sheer volume of materials, however, and the cumbersome methods of gathering data manually was so overwhelming that the project was abandoned for many decades. Subsequent efforts, most notably one undertaken in the 1950s by Bruno Snell at the University of Hamburg, were also abandoned. Marianne McDonald, a student of Snell's, was well aware of the challenge. Being the daughter of Eugene McDonald, founder of the Zenith Corporation, she also understood the potential of technology and how it could transform her field. So, it was the advent of modern technology and a young classicist's search for "terms of happiness" that provided the catalyst and impetus for the creation of the electronic thesaurus, namely the TLG, fittingly named after its Renaissance predecessor.

First Planning Conference

TLG Planning conference participants The Project was officially established on October 30, 1972, the date of the first TLG Planning Conference. Classicists from Europe and the US were invited to attend and offer advice on a number of questions such as the chronological scope of the project, the selection of appropriate text editions, policies to be followed regarding digitization, and so on. The exact scope of the project had not been defined until then. It was described as "a lexicographical work which will collect, sort, and identify every single word extant in ancient Greek literary and non-literary documents."

The conference was attended by Winfried Buehler (University of Hamburg), Aubrey Diller (University of Indiana), Wilhelm Ehlers (TLL in Munich), G.M.A. Grube (Trinity College), Albert Henrichs (Harvard University), Charles Murgia (UC Berkeley), Brooks Otis (University of North Carolina, Chapel Hill), David W. Packard (UCLA), Jaan Puhvel (UCLA), Bruno Snell (University of Hamburg), Stephen Waite (Dartmouth College) and the Classics faculty at UC Irvine, most notably, Peter Colaclides, Theodore Brunner and Luci Berkowitz. UC Irvine's first Chancellor, Daniel G. Aldrich (seen in this photo sitting by the window) also attended the Conference.

Theodore F. Brunner, chair of Classics at the time, was named first TLG Director, a position he held for almost 25 years. Marianne McDonald, whose generous gift of $1-million made this ambitious undertaking possible, insisted on anonymity until her graduation from the University of California and attended the conference as a member of the Classics department. Her role and contributions became known later and were honored by the American Philological Association (now Society for Classical Studies) which awarded her the Distinguished Service Award in 1999.

The 1972 Conference established many of the principles that governed the project for years to come. For example, it was decided that the TLG would not be a lexicon, as the name 'thesaurus' might indicate, but a collection of texts ('databank') from Homer to 200 A.D. (an arbitrary cut-off date that was later modified), that it would not contain epigraphical and papyrological works, and that it would not include the critical apparatus of the texts.

David W. Packard and the Ibycus System

The challenge of this huge undertaking was originally met with the help of several classicists and technology experts, but primarily thanks to the efforts of David W. Packard who created the Ibycus system, namely the hardware and software originally used to correct the digitized texts and to search the TLG corpus. Packard also developed the character and formatting encoding convention for Greek, called Beta Code. Beta Code assigns an ASCII position to each of the twenty-four Greek letters. Diacritics are indicated by non-alphabetic characters following the accented vowel. Beta Code remains to this day the most practical way to encode polytonic Greek data. TLG digitization has always been done via double-keyboarding, in Korea, the Phillipines, and now in China. Even today, the texts are shipped to China where typists, ignorant of Greek or English, enter the Greek characters in Beta Code which is converted into Unicode or other types of Greek fonts.

The Canon of Greek Authors and Works

TLG Canon volume

In addition to digitizing texts, the TLG developed the Canon of Greek Authors and Works, originally a "registry" of all works included or about to be included in the corpus. Over time the Canon developed into an indispensable tool for the study of Greek literature. Luci Berkowitz and Karl Squitier produced the bulk of these bibliographies until the early 1990s, when they retired from the University. Since 2001 the Canon is is searchable on-line. It has been substantially expanded and updated (since its print publication) to include new editions and a large number of Byzantine and post-Byzantine works.The

The Early Days: Greek Texts on CD ROMs

Digital texts were distributed to interested scholars on magnetic tapes as early as 1976. In 1985 Packard and William Johnson co-founded a company and developed the Ibycus Scholarly Personal Computer (PSC), the first computer to allow the editing, search, and retrieval of classical texts in a fully integrated desktop package. Designed before the introduction of the Apple Macintosh, this personal computer combined multi-lingual word processing with a high-speed, CD ROM-based delivery system for large text data banks such as the TLG Greek corpus. Packard (assisted by Johnson and Wilkins Poe) also developed the indices and other subsystems which permitted rapid and discerning selection and retrieval of TLG texts on CD ROM.David Packard With their help, the TLG took the first step towards a wider distribution of its digital texts, namely the production of a CD ROM (TLG A), containing 27 million words. TLG A was the first compact disk ever produced and it did not contain music. This compact disk, which was meant to be read by Ibycus, did not contain any search software. As microcomputers became more popular and readily available in the late 1980s, a number of search programs were developed by independent software developers to search the TLG disk on Macs or PCs. Nevertheless, Ibycus PSC had an important influence on subsequent developments in the field of text data bank processing, and was formative in motivating scholars of antiquity to utilize computer resources well ahead of their peers in adjacent disciplines.

In 1988 the TLG® released its second CD ROM (TLG C) with 42 million words. At the same time, the TLG Advisory Board (chaired by Ihor Sevceno) approved the expansion of the project's scope to include Byzantine historiography, lexicography, and the scholia.

In 1993, in the middle of a budget crisis, the University of California offered its faculty incentives for early retirement. Both Theodore Brunner and Luci Berkowitz opted to retire in 1994 but were recalled at 50% time, Luci Berkowitz to teach courses in the Classics Department and Theodore Brunner to recruit his successor. This was a critical time for the project. All options were considered, including the possibility of downsizing the TLG to a small unit to license CD ROMs. Fortunately, this option did not materialize. Three years and three recruitment later, Maria C. Pantelia joined the University of California in 1996 as the second TLG Director.

The TLG into the 21st Century

Maria Pantelia was immediately faced with a number of critical choices. Due to Ted Brunner's early retirement, the project had gone through a period of reduced activity. By 1996, technology was moving rapidly and web technologies were making huge strides. The TLG was hampered by once revolutionary but now outdated technology. The project had no programming expertise, no detailed documentation on how to produce its CD ROMs, and the Ibycus system could no longer handle the size of the corpus. What was worse, the Packard Humanities Institute (PHI) no longer had the hardware and software to produce another CD ROM. Maria PanteliaPantelia realized that a number of areas had to be addressed in an accelerated fashion. The TLG® needed an infrastructure based on contemporary, widely used technologies, and qualified research and programming staff capable of performing the complex interdisciplinary tasks dictated by the nature of the Project. These included the development of an in-house system to replace Ibycus; software to produce CD ROMs and make the TLG data available to the scholarly community; and an on-line search and retrieval system to move TLG dissemination to the Internet.

Pantelia's first priority was to form a team of IT enthusiasts. In early January 1999 she recruited Nick Nicholas (Ph.D. in Linguistics, University of Melbourne), and Nishad Prakash (B.S. in Classics and Computer Science, Ohio Wesleyan University). Nicholas and Prakash brought with them much enthusiasm and rare interdisciplinary expertise, namely knowledge of Greek and extensive programming experience. Pantelia, Nicholas and Prakash (together with researchers Astrid Steiner-Weber, Antonia Giannouli and Rosa Parent), worked around the clock for over 6 months to update the TLG data and infrastructure. Nicholas undertook the project of developing software to check the formatting and spelling (V&C) of the digital texts. Prakash focused his attention on transferring the Canon data to a relational database and on developing a web-based system for cataloging and editing the Canon data. By the summer of 1999, the TLG had a new text correction system and a database designed to store the Canon data.

Disconnecting the Ibycus mainframe

Migration out of Ibycus was a Herculean effort that lasted several months. Thousands of texts had to be downloaded from Ibycus using a 2200 baud modem. For two months, project staff worked 8-hour shifts downloading texts, one at a time. Large files were particularly difficult to download because Ibycus froze whoever a text exceeded its memory capabilities. Some texts were corrupted and had to be manually reconstructed. The Canon file containing thousands of bibliographical records was too large to download and had to be broken into pieces. When it was finally extracted, all formatting was lost and had to be re-entered manually.

In September 1999, the project said farewell to Ibycus. The HP-1000 was disconnected and replaced by the new in-house system.


TLG Disk

With the necessary infrastructure in place, Pantelia turned her attention to producing a new CD ROM. The TLG had not issued a disk since 1992. A new disk was greatly anticipated. Work on CD ROM E began in the early summer of 1999. In the absence of proper documentation, Nick Nicholas was able to decode the format developed by the Packard Humanities Institute and reverse engineer the software to produce a disk that would be compatible with earlier CD ROMS. Once the basic programming was in place, the project collaborated with software developers to ensure that existing search engines would work well with the new disk.

TLG E was released in February 2000. TLG E was a landmark in the history of the Project, the first TLG CD ROM developed entirely in-house by M. Pantelia and her team.

A great deal had been accomplished in a year's time. The most important development was that the project had a better understanding of its data and therefore the ability to move forward confidently without relying on outside entities.

The On-line TLG

In April of 2001 the TLG released its first on-line version, providing web access to its bibliographical and textual resources. The development of the on-line TLG took approximately one year and involved the creation of a search engine, user interface, and site documentation. In addition to the programming tasks, massive correction of the texts was done during that time. The new Correction and Verification system (V&C) allowed the TLG to correct errors more efficiently. As a result the corpus grew dramatically. An Abridged version of the corpus was established containing a number of works from authors that have traditionally been used in college-level instruction of Greek (Homer and Hesiod, the dramatists, Plato and Aristotle, Xenophon, the orators, etc.). The Abridged version was meant to make the TLG accessible to high-schools and small undergraduate programs that might not be able to afford a subscription to the Full corpus. The Abridged version has continued to grow. As texts are replaced by new editions, earlier editions are added to the Abridged version.

The full Canon of Greek authors and works was also made available to the public. In 2002, thanks to an equipment grant by Sun Microsystems, the on-line TLG was installed on an UltraSPARC-II server, which was appropriately dubbed "Stephanus." This server has since been replaced by several generations of servers that can sustain high usage and provide fast access to the TLG data.

The Extension of the Canon into the Byzantine Period

In the late 1980s the TLG's scope was extended to include a selection of late texts, mostly Byzantine scholia, historiography and lexicography. This arbitrary division made no sense ten years later. Maria Pantelia saw the importance of expanding the corpus into the Byzantine and post-byzantine period. She decided that the TLG should be a corpus of all Greek literature.

The expansion of the corpus began as soon as the new V&C system was in place. Maria Pantelia assumed the responsibility of editing the Canon since 1997. 2022 OUP Publication cover In 2022 all bibliographical information included in the TLG was published by the University of California Press

The pace of digitization also increased. Today the on-line version contains more than twice as many texts compared to TLG E. Most extant authors and works up to the 18th century have been included in the TLG and work is underway to fill in any gaps left and to expand the corpus into the modern era. The conception of the Canon has been both refined and expanded. In addition to assigning a four digit number to each author and a 3-digit number to each work, the Canon database now includes a complex citation system that describes the structure of each text included in the corpus.

Collaboration with Unicode

From 2002-2004 the TLG undertook the task of completing the Polytonic Greek set for the Unicode Standard. Maria Pantelia and Richard Peevers, worked with the Unicode Technical Committee (UTC) to ensure that all characters extant in Greek texts were properly represented in the Unicode Standard. The TLG submitted 14 proposals totaling more than 200 characters. There were all approved by both UTC and ISO. ( TLG proposals to unicode) As of Unicode version 5.1, all characters proposed by the TLG are part of the Unicode Standard. A Quick Reference Guide was also put together so that font developers can find quickly which Unicode code point corresponds to each character. Thanks to these efforts, scholars of ancient and medieval Greek now have a complete set of accented and technical characters permanently encoded into the Unicode conventions. This guarantees that Polytonic Greek can be uniformly read across all computer platforms and browsers.

The Lemmatization Project

Because of the highly inflected nature of Greek, and the linguistic and orthographic heterogeneity of the corpus, searching the TLG for individual word forms can be tedious. In early 2003, the TLG undertook the project to lemmatize its corpus. The first effort to employ computers in the morphological analysis of Greek was undertaken by David W. Packard in the early 1970s. The TLG lemmatizer is an extensive system that at present can recognize automatically 98.3% of all wordforms in the TLG corpus. To reach this goal, the TLG digitized and extracted a large number of headwords from dictionaries (such as LSJ, Lampe, Bauer, Trapp, Kriaras). The lemmatized search engine was made available in beta form in 2006 and is now a regular feature of the TLG search engine.

The TLG and Greek Lexicography

As part of the lemmatization project, the TLG became active in developing online lexica. The first major effort was the digitization of the Liddell-Scott-Jones (LSJ) dictionary, the premier lexicon for ancient Greek. Cunliffe's Lexicon of Homeric Greek and Powell's Lexicon to Herodotus followed. In 2013 the TLG and the Austrian Academy of Sciences joined forces to create a digital version of the Lexikon zur byzantinischen Gräzität (LBG). In 2020 the project released a searchable version of S. Koumanoudes' Συναγωγή νέων λέξεων.

The TLG Search Engine

The TLG released a new and updated version of its search engine in February 2015. The new search engine incorporates all the features of the original system and more. The search mechanism was designed to be intuitive and offer a streamlined interface. In addition to substantially expanded lookups, it offers intertextual phrase matching, statistical analysis and quick access to the TLG dictionaries.

Forty-eight years after its establishment, the Thesaurus of the Greek Language is a reality. The TLG® Digital Library contains virtually all Greek texts surviving from the period between Homer (8 c. B.C.) and the fall of Byzantium in A.D. 1453 and a large number of texts up to the 20th century.. The TLG corpus is available in more than 2,000 universities and research centers around the world and used by thousands of researchers, educators, and students from a wide range of disciplines such as classics, archaeology, history, art, history, philosophy, linguistics, and theology/religious studies.