Some Thoughts on Archiving Manuscripts of Ancient Assam

Prof. Smriti Kumar Sinha
Dept. of Computer Science and Engineering
Tezpur University


sachipatIn Assam, we have a huge corpus of medieval manuscripts written mainly on Sachipat. Medieval manuscripts are important sources of tradition, history and culture. We need to preserve, organize and make them available to the world. These cultural objects are available in different media, like parchment, vellum, palm leaves, Sachipat, paper etc, encoded in different languages and scripts. Many technologies have evolved over ages to archive these cultural objects.  Information and Communication Technology (ICT) claims to be the most potential modern technology for such archival activities. ICT is not a panacea for all problems in archival process. ICT has tremendous potential as a candidate solution for archiving but at the same time it has its own limitations.

Archiving is an evolving process. In the evolutionary track of the process, many technologies were used. At those points of time those were the best technologies available. In this second decade of twenty first century ICT is, without any doubt, the best technology available. But who knows, a better technology may come up at the end of the current century or in the next. When people used Sachipat or later paper as the media of writing, could they even imagine something called digital media would come in the subsequent century? Imagination is also bounded by a window of capability. Interestingly enough, the window is sliding which gives us the virtual unbound nature of imagination. In the present article, I would like to discuss mainly the advantages and the limitations of ICT as an archival technology.

Now the fundamental question is what objects do we archive?  We archive expressions in the form of symbols, icons, artifacts, records, documents etc. The next question is why do we archive? We archive for different reasons. We can access the objects whenever required. We can understand the evolution of certain aspect of civilization by analyzing the archived objects and so on. By studying the expressions we can reconstruct the thought process involved, input to the thought process and ultimately the social context prevailing at that time as a whole.

Broadly the objectives of archiving are preservation, organization and sharing within and across generations. This set of objectives will be the guiding framework for our discussion.

Archiving in Biological System

DNABefore going to archive the cultural objects, which are creations of human mind, let us look into the Nature, specially the biological system. God, the creator, has meticulous plans  for preservation, organization and sharing within and across generations of His creations. The plan was unknown to us for centuries. Only very recently it was revealed by the advent of genetic research. God archived His creations in chromosomes. A chromosome is an organized structure of DNA, protein, and RNA found in cells. Chromosomal DNA encodes most or all of an organism’s genetic information. A gene is a locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions, and or other functional sequence regions. Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. These products are often proteins, but in non-protein coding genes, such as ribosomal RNA (rRNA), transfer RNA (tRNA) or small nuclear RNA (snRNA) genes, the product is a functional RNA. The process of gene expression is used by all known life forms (ref. wiki).

We may not know life in totality, but we know today the language of life. Infinite sentences of life are composed of symbols from three finite sets of symbols

DNA =  {g, a, c, t} Bases: guanine [g], adenine [a], cytosine [c], thymine [t]

RNA = {g, a, c, u} Bases: guanine [g], adenine [a], cytosine [c], uracil [u]

PROTEIN= {g,a,v,l,i,m,f,w,p,s,t,c,y,n,q,d,e,k,r,h}

Amino Acids: glycine (g), alanine (a), valine (v), leucine (l),   isoleucine   (i), methlonine (m), phenylalanine (f), tryptophan (w),  proline (p), serine (s), threonine (t), cysteine (c), tyrosine (y),  asperagine (n), glutamine (q), aspertic acid (d), glutamic acid (e),  lysine (k), arginine ®, histidine (h)

 Creator of life has composed beautiful compositions using this language of life. He preserved His creations in this language in chromosomes and genes, which are transmitted across generations, which are triggered, activated, stopped in time, like present day software. At a distance of time and space, software written by a programmer is triggered, executed and terminated.  Script of life is encoded in the DNA, RNA and PROTIEN sequences. Biological manuscripts for life processes are archived successfully for millions of years. God’s archive is meticulously planned; contents are durable for million years and sharable across generations.

A Mistake in God’s Planning

God created different life forms and accordingly necessary hardware and software are encapsulated successfully. But by mistake, tremendous ability of thinking is encapsulated only in case of human beings. Over time the creative mind of man started creating hitherto unknown non-biological objects or expressions: artifacts, songs, dances, poems, epics …Since these were out of God’s plan, He has no archival plan for these cultural objects.  We are to plan for archiving our own creations.

It is not be a mistake. How can God commit a mistake? Who am I and how do I dare to find mistakes in God’s plan?  It is perhaps the intension of God so that we can find out our own mechanism of archiving our own creations.

Cultural Chromosome

In different branches of knowledge, including manuscriptology, we are trying to preserve our knowledge, so that it can be transmitted to generations to come as a task to carry forward our civilization. Eventually we are to find out a cultural chromosome analogous to biological chromosome.  The structure of cultural chromosome will organize our cultural traits for easy access; preserve the content for million years. Through Mutation and crossover operations of cultural genes new and new traits adaptable to new social context will evolve. We are to define the language of culture, its alphabet, so that we can encode and decode our cultural creations so that we can preserve our cultural objects for million years, transmit them to our generations to come. The structure and composition of cultural chromosome is not clear today but we may hope that collective efforts of science and culture will develop the cultural chromosome in the near future.

 Memory is an Archival Medium

memoryHuman memory is the most primitive and primary medium of archiving and neurons are the basic units. Memorization is the process of archiving. The content of the memory is transmitted from one generation to the next as shruti or oral tradition. When no other medium was available, mankind archived the acquired information, knowledge and wisdom in memory, institutionalized the mechanism of transmission across generations in different forms of tradition. When all other archival media are destroyed, human race will perhaps bank on memory for recreating the knowledge corpus. It happened again and again in history.

Ancient library of Alexandria was one of the largest and most significant libraries of the ancient world. It flourished under the patronage of the Ptolemaic dynasty and functioned as a major centre of scholarship from its construction in the 3rd century BC until the Roman conquest of Egypt in 30 BC. The library was conceived and opened either during the reign of Ptolemy I Soter (323–283 BC) or during the reign of his son Ptolemy II (283–246 BC). The library is famous for having been burned, resulting in the loss of many scrolls and books, and has become a symbol of knowledge and culture destroyed (ref: wiki).

Nalanda was one of the world’s first residential universities, i.e., it had dormitories for students. It was also one of the most famous universities. In its heyday, it accommodated over 10,000 students and 2,000 teachers. Nalanda was ransacked and destroyed by an army under Bakhtiyar Khilji in 1193. The great library of Nalanda University was so vast that it was reported to have burned for three months after the invaders set fire to it, ransacked and destroyed the monasteries, and drove the monks from the site. (ref: wiki)

Around 1732 AD, during the reign of Garib Niwaz or King Pamheiba of Manipur, all the ancient Manipuri manuscripts were burned following the advice of vaishnavite guru Shantadas Goswami.

All such historical events point finger to the fact that if the backbone of a civilization is to be broken then cultural continuum of the civilization is to be stopped and that can be done by destroying the archives, the repository of knowledge. There is a very famous film, Fahrenheit 451 (1966) directed by François Truffaut, based on the 1953 novel of the same name by Ray Bradbury. 451 degree Fahrenheit is the temperature at which papers burn. The film is on the theme of burning books. A totalitarian government employs a force known as Firemen to seek out and destroy all books. Eventually all books were burned. But there was a hidden sect of people, called book people, each of whom have memorized a single book to keep it alive. Thus in the film, it was shown that human memory and oral transmission were the last resorts of mankind to archive great books.

There are many limitations of human memory as an archive. It is not known properly how data is stored in the neurons. We can’t read or access the content of memory unless the owner expresses in different forms. So the access to the content in memory is indirect. Moreover human memory is not persistent.

Other Media

tabletWith the advancement of civilization we get persistent media to archive like stones, earthen tablets, papyrus, vellum, sachipat, papers etc.  I don’t like to discuss these media as hundreds of articles on these are available. The advantage of such a medium is that the content is persistent and is directly accessible. They have varied degrees of durability and portability. But all the above media are not space optimal, so not suitable for large volume of content. Moreover content stored in such media are not efficiently searchable.

Encoding schemes for expressions

In ancient time, cave dwellers expressed their creative talents in the paintings on the walls of the caves. Even today, paintings and images are vibrant expressiHyroglyphons of culture. Abstract symbols, logos followed suit. In ancient Egypt, a logographic script called hieroglyph, was developed. This ancient script consists of pictures as characters. The system was not efficient as the number of pictures was large.

phoneciansPhoenicians came up with an efficient writing system. In ancient time Phoenicians were highly civilized. They travelled to Egypt, Greece, Indus Valley and many parts of the world with the ships or boats made from Cedar trees. Ancient Phoenicia is present day Lebanon. We find a Cedar tree in the national flag of Lebanon. Around 1200 BC Phoenicians provided an alphabet with limited symbolic characters.  These characters represented the sounds that can be produced by human beings. So with Phoenician alphabet any human language could be encoded. As a result, it became the default writing system. It is said that initially Phoenician alphabet had only consonants. Greeks added vowels later on. All the modern alphabets used today in many languages are directly or indirectly derived from Phoenician alphabet.

Digital Media

So long we have discussed some philosophical and historical thoughts only. Let us now land on the ground reality and discuss the most potential medium for archival activity using ICT. Digital media is such a medium where we can store hypermedia information in a uniform way. Hypermedia means union of text, image, audio and video information.

hypermedia = text + image + audio + video

Every component of hypermedia is uniformly represented by two symbols 0 and 1. In digital media 0 and 1 are represented as a BIT, which is an acronym for BInary digiT. 0 and 1 are basically two states we are to represent. In magnetic storage devices, like hard disks, pen drives etc. 0 and 1 are represented by direction of magnetization: clockwise (0) or anti-clockwise (1). In optical media both are represented by the principle of reflection of light. We know normal incident ray on a plain surface is reflected back in the opposite direction of incident ray. Now the smooth surface of an optical  medium, like a CD, pits are created on a track while burning the CD with a CD writer. Remaining are lands. Pits and lands  represent 1 and 0 respectively. When a laser ray is normally incident on a land it will be reflected back in the opposite direction with same intensity representing 0. When it falls on a pit, it will be scattered so intensity of the component of the ray in the opposite direction will be much less representing 1. In electric media 0 and 1 are represented by different voltage level, say 0V represents 0 and 5V represents 1. 8 bits constitute a byte, 1024 bytes a Kilo Byte, 1024 KB a Mega Byte and so on.


Text in a digital medium is encoded using an encoding scheme. The American Standard Code for Information Interchange (ASCII) is a character-encoding scheme originally based on the English alphabet. Original ASCII encodes 128 specified characters – the numbers 0-9, the letters a-z and A-Z, some basic punctuation symbols, some control codes and a blank space – into the 7-bit binary integers. 8-bit variant, extended ASCII, encodes 256 characters. ASCII cannot represent alphabets of all languages in the world. To accommodate all writing systems Unicode scheme is developed. Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world’s writing systems. Unicode contains a repertoire of more than 110,000 characters covering 100 scripts. Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8, UTF-16. UTF-8 uses one byte for any ASCII character, which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters. UTF-8 is a variable-width encoding that can represent every character in the Unicode character set. It was designed for backward compatibility with ASCII. UTF-16 uses two 16-bit units (4 × 8 bit) to handle characters.


A digital image is a numeric representation (normally binary) of a two-dimensional image. It may be of vector or raster type. Digital image usually refers to raster images also called bitmap images. Generally speaking, in raster images, image file size is positively correlated to the number of pixels in an image and the colour depth, or bits per pixel, of the image. There are different image formats available, each has its characteristics.  The important formats are:

  • JPEG : Nearly every digital camera can save images in the JPEG format, which supports 8-bit gray scale images and 24-bit colour images (8 bits each for red, green, and blue). JPEG applies lossy compression to images, which can result in a significant reduction of the file size.
  • JPEG 2000: they improve quality and compression ratios, but also require more computational power to process. JPEG 2000 also adds features that are missing in JPEG. It is not nearly as common as JPEG, but it is used currently in professional movie editing, archival and distribution (some digital cinemas, for example, use JPEG 2000 for individual movie frames).
  • The TIFF (Tagged Image File Format) format is a flexible format that normally saves 8 bits or 16 bits per colour (red, green, blue) for 24-bit and 48-bit totals, respectively, usually using either the TIFF or TIF filename extension.
  • GIF (Graphics Interchange Format) is limited to an 8-bit palette, or 256 colors. This makes the GIF format suitable for storing graphics with relatively few colours such as simple diagrams, shapes, logos and cartoon style images.
  • The BMP file format (Windows bitmap) handles graphics files within the Microsoft Windows OS. Typically, BMP files are uncompressed, hence they are large; the advantage is their simplicity and wide acceptance in Windows programs.

UNESCO established the Memory of the World program in 1992. One of the main goals of the digitization project was to minimize the need for handling original manuscripts. UNESCO recommended using compressed JPEG image files for Memory of the World projects due to JPEG’s excellent ratio of quality-to-file size, the ability to upload and view JPEGs at a reasonable speed, and JPEG’s compatibility across computer platforms.

Normally when we talk about digital archiving of manuscript, our focus remains on the scanning of ancient manuscripts into digital images without damaging the fragile manuscript.  But the task doesn’t end there; rather capturing the digital images of the ancient manuscripts is the beginning of the process. A Sachipat manuscript contains mainly text and some associated images. We can think of converting the text content into Unicode text by using optical character recognition devices and algorithms. It is a common method of digitizing printed texts so that they can be electronically searched, stored more compactly, displayed on-line, and used in machine processes. Our problem is text contents of Sachipat manuscripts are handwritten. Research is going on in Tezpur University and IIT, Guwahati for recognition of handwritten Assamese documents. Why not we use the research output for image to text conversion of Sachipat manuscripts? Simply storing digital images of ancient manuscripts only is not archiving.


To harness the potential of digital media, apart from digital image and image to text conversion, we can think of other aspects also. One important aspect is metadata. The descriptive information attached to digital archiving may be commonly referred to as metadata.  The information like where the manuscript was found, how old is this, who is the owner etc. are the metadata, which are to be encapsulated and stored along with the digital version of the manuscript. However a successful metadata system should satisfy needs of cataloguers, users, technical experts and administrators. Thus it should be flexible, extensible and forward looking which allow easy searching and browsing by the user at different points of access to collection.

At present the best technology for organizing metadata of archived manuscripts is XML. XML stands for eXtensible Markup Language. It is a descriptive markup language, not a programming language. A markup language  encodes data for processing where as a programming language encodes the logic of processing. A descriptive markup languages are simple text, hence readable by any standard text editor, like notepad, edit etc. XML is extensible means it has no predefined tags; only a set of rules to follow. An archivist can define meaningful tags which will best fit to reflect the semantic of the archival data. The grammar of use of these user-defined tags, if required, can also be specified by the archivist using XML DTD or XSCHEMA. XML technology seems to offer best possible solution for the metadata system of choice.


XML is good for semantic encoding of digital data. It is designed not for presentation. For beautiful presentation in Web we have another markup language, HTML. HTML stands for HyperText Markup Language. XML is for semantic encoding whereas HTML is for encoding for presentation. The tags in HTML are not extensible. That doesn’t mean that HTML is being replaced by XML. They are complement to each other and play the concert in the Web as yugalbandi. XML data can be easily transformed to HTML pages. XHTML is the HTML satisfying the rules of XML. With HTML we can present all components of hypermedia through Web technology to the whole world.


In digital age, obviously World Wide Web (WWW) is the best instrument of sharing information and knowledge world-wide. Because of WWW today information can flow across all boundaries- geographical, economical and political. Thereby establishes an equilibrium state of knowledge and information. The rich pool of human knowledge and information is now instantly shared by anybody, irrespective of caste, creed and citizenship of underdeveloped, developing or developed countries. Moreover, people can enrich the pool by publishing their domain specific knowledge in the Web from any corner of the world. Digital archives connected to Web can be visited by anybody from anywhere in the world. WWW was originated by Timothy Berners Lee in 1989 at CERN. The global enabling infrastructure behind the success of WWW is Internet. WWW is so popular that general people think WWW and Internet are the same. In reality it is not so. WWW is a service, like other services, provided over Internet.

 Limitation of Digital archiving

Along with so many advantages digital media has certain limitations also. For digital archiving these limitations are important. Digital objects are not directly readable. Nobody can read the digital objects archived in digital media unless specific software and devices express them in a human readable form. We are back to the same situation when human memory was used as the medium of archiving. Device drivers play a vital role in this regard. Device drivers are the small pieces of software associated with each device. Only through proper device drivers other devices or application software can talk to a particular device, like a storage device. If the device driver is obsolete or damaged than the whole content is lost. The formats of digital images and other components of hypermedia are fast changing. Backward compatibility is taken care of only up to certain extent. The media handlers are format dependent. If at a particular point of time, a particular old format is not recognized by any media handler of that type, say images, then the content stored in the storage in that particular format is no longer accessible. Even, if we access that it will be a string of zeros and ones, not comprehensible. The digital storage media are vulnerable to external forces. For example, magnetic disks, if exposed to high magnetic field, the content may be changed or damaged. That means, the durability of the digital content depends on the reliability of devices.


Before winding up I would like to say that before delving into digital archiving, a thorough SWOT analysis of digital media in the context of archiving should be done. Through the analysis strengths, weaknesses, opportunities and threats of digital media will come in hand. Meticulous planning is required apriori. Security, safety and monitoring planning are of foremost importance. Backup planning of the archived cultural objects should be in place from the very first day.  Versions of content handlers are fast changing.  So, version handling plan is required. As discussed above, formats for hypermedia content is also changing fast and some formats become obsolete. Each format has its own advantages and limitations. Archived digital content is required to be migrated accordingly. So, content migration plan should be in place. Along with digital cultural objects, necessary system and application software including device drivers and content handlers need to be archived to avoid unforeseen situations in future. Lastly I would like to conclude with the remark that archiving is an evolving process. At this point of time digital media may be the best. In distant future, some other better candidate may come up and that candidate technology may or may not be digital.




[2] S M Shaf, Digitization Perspective of Medieval Manuscripts

[3] Zdeněk Uhlíř, Digitization is not only making images: Manuscript studies and digital Processing of manuscripts.

[4] Ramesh C Gaur, Mrinmoy Chakraborty, Preservation and Access to Indian Manuscripts : A Knowledge Base of Indian Cultural Heritage Resources for Academic Libraries.

[5] Wayne  W.  Torborg,  Theresa  M.  Vann  and  Columba  Stewart  OSB,   The  Challenges  of  Manuscript  Preservation  in  the  Digital  Age.