# 4.3: String (atomic)

$$\newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$$

$$\newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}}$$

$$\newcommand{\id}{\mathrm{id}}$$ $$\newcommand{\Span}{\mathrm{span}}$$

( \newcommand{\kernel}{\mathrm{null}\,}\) $$\newcommand{\range}{\mathrm{range}\,}$$

$$\newcommand{\RealPart}{\mathrm{Re}}$$ $$\newcommand{\ImaginaryPart}{\mathrm{Im}}$$

$$\newcommand{\Argument}{\mathrm{Arg}}$$ $$\newcommand{\norm}[1]{\| #1 \|}$$

$$\newcommand{\inner}[2]{\langle #1, #2 \rangle}$$

$$\newcommand{\Span}{\mathrm{span}}$$

$$\newcommand{\id}{\mathrm{id}}$$

$$\newcommand{\Span}{\mathrm{span}}$$

$$\newcommand{\kernel}{\mathrm{null}\,}$$

$$\newcommand{\range}{\mathrm{range}\,}$$

$$\newcommand{\RealPart}{\mathrm{Re}}$$

$$\newcommand{\ImaginaryPart}{\mathrm{Im}}$$

$$\newcommand{\Argument}{\mathrm{Arg}}$$

$$\newcommand{\norm}[1]{\| #1 \|}$$

$$\newcommand{\inner}[2]{\langle #1, #2 \rangle}$$

$$\newcommand{\Span}{\mathrm{span}}$$ $$\newcommand{\AA}{\unicode[.8,0]{x212B}}$$

$$\newcommand{\vectorA}[1]{\vec{#1}} % arrow$$

$$\newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow$$

$$\newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$$

$$\newcommand{\vectorC}[1]{\textbf{#1}}$$

$$\newcommand{\vectorD}[1]{\overrightarrow{#1}}$$

$$\newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}}$$

$$\newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}}$$

$$\newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$$

$$\newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}}$$

In the pre-digital era data was preserved in the form of books and data tables, with data that has been excerpted from the primary literature often being classified as legacy data.  A string character is just a digital representation of a letter or number and can not be used in computational mathematical operations (if you add one and one you get eleven, not two).  That is, it is two ones next to each other.

## ASCII (American Standard Code for Information Interchage)

One of the first things needed to be encoded if people were to move from typewriters to word processors were the characters of the alphabet. In the early days of computers 8 bit chip of memory was common and the American Standard Code for Information Interchange (ASCII) was developed and based on the 8 bit byte, which shows one set of code that allows computers to interact with a keyboard to store information.  Note the first 32 ASCII characters are unprintable codes used to control devices, and the remaining 156 characters are used to store symbols like numbers and the letters of the alphabet.  We are calling this an atomic code level as each byte (8 bits) is the encoding of a single character of the class string.  As we shall see, a word is also a string, but a string of multiple characters.

The original asciII code was 7 bit and the extende asci II is 8 bit and so there are 256 characters.  The complete table can be seen at www.ascii-code.com and a few representations are shown in the table below.

 Dec BIN Symbol Description 0 00000000 Null Char 10 00001010 Line Feed 48 00110000 0 Zero 68 01000100 D Uppercase D 100 01100100 d Lowercase d 128 10000000 € Euro 197 11000101 Å Angstrom symbol

The original ASCII code was 7 bit and the extended asci II is 8 bit and so there are 256 characters.  The complete table can be seen at www.ascii-code.com. One of the shortcoming of ascii is that there are only 256 possible representations and so there is a limit to the number of symbols (or commands like "tab") that can be encoded.

## Unicode & UTF-8

In 1988 the Unicode Consortium was formed, which was formed as a public benefit (non-profit) in 1991 and unicode is a variable length encoding schema based on the UTF-8 (Universal coded character set Transformation Formate - bit) format.  This is backward compatible with ascii, but the variable length of between 1 and 4 eight bit bytes allows for the encoding of of more characters (two bytes = 256 x 256=65536 possible code points) and this allows for various symbols, sub/superscripts beyond the original ascii.

STIX fonts - the Scientific Technical and Information Exchange (STIX) is a project based on UTF-8 that provides font for scients and is integrated into Google Fonts.

In Python strings are also a class of multiple characters, and what is being described here is the representation of a single character of this class.  See the section on string (container) to learn about the methods associated with a string.

References

Information Science for Chemists by Stuart Chalk, 2015 Cheminformatics OLCC

This page titled 4.3: String (atomic) is shared under a not declared license and was authored, remixed, and/or curated by Robert Belford.