# 1: Introduction to Data

$$\newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$$

$$\newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}}$$

$$\newcommand{\id}{\mathrm{id}}$$ $$\newcommand{\Span}{\mathrm{span}}$$

( \newcommand{\kernel}{\mathrm{null}\,}\) $$\newcommand{\range}{\mathrm{range}\,}$$

$$\newcommand{\RealPart}{\mathrm{Re}}$$ $$\newcommand{\ImaginaryPart}{\mathrm{Im}}$$

$$\newcommand{\Argument}{\mathrm{Arg}}$$ $$\newcommand{\norm}[1]{\| #1 \|}$$

$$\newcommand{\inner}[2]{\langle #1, #2 \rangle}$$

$$\newcommand{\Span}{\mathrm{span}}$$

$$\newcommand{\id}{\mathrm{id}}$$

$$\newcommand{\Span}{\mathrm{span}}$$

$$\newcommand{\kernel}{\mathrm{null}\,}$$

$$\newcommand{\range}{\mathrm{range}\,}$$

$$\newcommand{\RealPart}{\mathrm{Re}}$$

$$\newcommand{\ImaginaryPart}{\mathrm{Im}}$$

$$\newcommand{\Argument}{\mathrm{Arg}}$$

$$\newcommand{\norm}[1]{\| #1 \|}$$

$$\newcommand{\inner}[2]{\langle #1, #2 \rangle}$$

$$\newcommand{\Span}{\mathrm{span}}$$ $$\newcommand{\AA}{\unicode[.8,0]{x212B}}$$

$$\newcommand{\vectorA}[1]{\vec{#1}} % arrow$$

$$\newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow$$

$$\newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$$

$$\newcommand{\vectorC}[1]{\textbf{#1}}$$

$$\newcommand{\vectorD}[1]{\overrightarrow{#1}}$$

$$\newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}}$$

$$\newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}}$$

$$\newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$$

$$\newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}}$$

## Data/Datum

Data is the plural form of the Latin word "datum", which refers to a "fact" or "something given". Data is a "mass plural noun" in that it is used for both the singular case of a piece of data and the plural case of a set of data (you do not say give me the titration datas, but the titration data). Although the word data refers to "information" the context differs between empirical science and computer science applications and it is worth stepping back and looking at.  To the empirical scientist data is used to understand natural (observable) phenomena while to the computer scientist it is used for digital information representation, that often results in computational tasks, algorithm development and information processing.

• Data and the Empirical Scientist
• Focus on Observation and Measurable Values
• Raw material from which scientists draw conclusions and postulate hypothesis
• Analyzed and interpreted to draw meaningful conclusions
• Data and the Computer Scientist
• Represents digital information and how that information is stored
• integers, floating numbers, string literals, boolean values
• Data bases and complex data structures
• Emphasis on organizing, transforming and extracting meaning from large data sets

## Empirical Science Data

These are essentially the results of observations and measurements and fit into two broad categories

1. Qualitative Data
• Descriptive features (rocky, sandy, wet, dry, 3 lobed leaves, 4 lobed leaves, 5 lobed leaves....)
• Amenable to Boolean Algebra (true or false statements)
2. Quantitative (Numerical) Data
• Two Types
• Counted - has a number and an entity (two moths)
• Exact number without uncertainty
• Can be represented by integer numbers
• Measured - has a number, unit and entity (2.2 grams of moth)
• Inexact number with uncertainty
• Can be represented by floating decimal numbers
• Amenable to Arithmetic and Boolean Algebra
3. Can be used to describe functional (causal) relationships between two or more variables
• y = f(x1, x2, x3... )
• y = dependent variable, x1, x2, x3... = independent variables
• Stored in files
• csv (comma separated values), tsv (tab separated values)
• XML (eXtensible Markup Language) that include metadata
• SensorML (Open Geospatial Consortium Markup Language)
• AniML (Analytical Information Markup Language)
4. Used to validate scientific theories
5. Historically shared through printed artifacts (Gutenberg era publications)
6. Digital Data
1. Legacy Data excerpted from the primary literature (printed journals)
2. Assay data acquired through automated techniques applied to empirical experiments
3. IOT data deposited to online databases in real time through environmental monitoring.

## FAIR Data

Fourth paradigm science involves data intensive discovery across often disparate data sets and for this to occur the data needs to be Findable, Accessible, Interoperable and Reusable, and this led to the FAIR data principles.  The following links go to organizations promoting FAIR data principles. Metadata is data about the data and proper metadata structures are key to FAIR data.