DataFerrett Metadata Interface File (MIF) mifNew XML Reference

Note: If not using the (OPTIONAL) element or attribute, DO NOT include the element in the MIF. Errors might occur if the element or attribute is blank/empty.

Note: If not using the (FUTURE OPTION) the MIF will validate currently will not be used in the dataweb system.

MIF XML Transactions
Dataset Level Metadata
Variable Level Metadata
new component mif example

Dataset Level Metadata

REQUIRED FIELDS

Element Name

Description

<component>

Dataset Name – limit 255 characters

example:
<component>Basic</component>

<longName>

Data Collection long name


– limit 255 characters

full name of the data collection

example:
<longName>Current Population Survey</longName>

<shortName>

Data Collection short name


– limit 12 characters

short name (acronym) – limit 255 characters

example:
<shortName>CPS</shortName>

<instance>

Dataset instance description

This is often a year, or month and year for the period which the data was collected or referenced.

example:
<instance>Jan 1994</instance>

<category>

Dataset data category with the following type attribute (microdata, aggregate, timeseries, longitudinal, multiDimensionCube, or multiDimensionCubeTimeSeries)

  • aggregate - data which has already been summarized or added up. Usually for aspecific geographical units or some other unit such as industry, classifications in this case, each record is a geographical unit and there is no summing needed to get the totals for the geographies.
  • microdata - data in which every record is at the unit of analysis level and all records must be added up to get the totals for each data item. For example, for surveys of individuals, microdata contain records for each individual interviewed; for surveys of organizations, the microdata contain records for each organization.
  • longitudinal - a panael data in which many units are observed over multiple time periods. The bureau of Labor Statistics National Longitudinal Surveys program collects data from a particular age group of people over many years on an annual or biennial basis. The panel data track the same sample of individuals over many time periods.
  • timeseries - is a sequence of observations which are ordered in time (or space). If observations are made on some phenomenon thoughout time, it is most sensible to display the data in the order which they arose, particularly since successive observations will probably be dependent. Time is called the independent variable.
example:
<category type="aggregate" />

<tabulationHost>

Must contain the following attributes

uri – Tabulation machine address and port (either domain name or IP address), but must be fully qualified and match the ioapi configuration file ("thsm" parameter).

type – Database type id. (refers to ioapi file "thsp" parameter)

example:
<tabulationHost uri="http://my.server.org" type="4505" />

<extractionHost>

Must contain the following attributes

uri – Extraction machine address and port (either domain name or IP address), but must be fully qualified and match the ioapi configuration file

type – Database type id. (refers to ioapi file "thsp" parameter)

example:
<tabulationHost uri="http://my.server.org" type="4505" />





OPTIONAL FIELDS



Element Name

Description

<subsurveyName>

(OPTIONAL) default: N/A

Dataset Intermediate Level Name (sub-collection) – limit 255 characters

example:
<subsurveyName>Intermediate Name</subsurveyName>

<display>

(OPTIONAL) default: normal

Affects the way component name and time are displayed in DataFerrett

Display attribte type
normal – (time at lowest level)
inverted – (dataset at lowest level)

example:
<display type="normal" >

<inheritedComponent>

(OPTIONAL) default: N/A

Inherited Dataset Name (Must use an existing dataset’s name, within the same Data Collection; e.g. CPS supplements inherit Basic CPS). Since a dataset can inherit multiple component this element can be repeated within a dataset.

example:
<inheritedComponent>Basic</inheritedComponent>

<embargo>

(OPTIONAL) default: id=0

for placing embargo rights on an instance of a dataset. This is assuming the group id has already been assigned. An id of 0 represents a public dataset

example:
<embargo id="0" />

<logo>

(OPTIONAL) default: N/A

Logo image URL for the dataset

An example of the logo

http://pubdb3.census.gov/images/cps_banner.jpg

example:
<logo uri="http://my.server.org/images/logo.jpg" />

<sponsorInfo>

(OPTIONAL) default: N/A

Sponsor Information

attribte names

  • name - Name of sponsor
  • homepageUrl - Website URL
  • imageUrl = Banner image URL
example:
<sponsorInfo name="org name" homepageUrl="http://my.server.org/index.html" imageUrl="http://my.server.org/images/image.jpg" />

<providerInfo>

(OPTIONAL) default: N/A

Provider Information

attribte names

  • name - Name of provider
  • homepageUrl - Website URL
  • imageUrl = Banner image URL
example:
<providerInfo name="org name" homepageUrl="http://my.server.org/index.html" imageUrl="http://my.server.org/images/image.jpg" />

<abstract>

(OPTIONAL) default: N/A

Dataset Description (Abstract) URL

example:
<abstract originaluri="http://pubdb3.census.gov/abstracts/cps_basic.html" />

<restriction>

(OPTIONAL) default: N/A

Dataset restrictions URL

Statement of dataset access or use restrictions.

example:
<restriction originaluri="http://pubdb3.census.gov/abstracts/cps_basic.html" />

<virtualId>

(OPTIONAL) default: N/A

ID used by the Harvard VDC System. Should not be used with anything else.

example:
<virtualId>id2009:22/test.tab</virtualId>

FUTURE FIELDS

These fields are defined in the schema, but NOT currently used by DataWeb Publisher – for future use only.



Element Name

Description

<collectDate>

NOT USED

Date the data were collected.

example:
<collectDate start="2002" end="2002" />

<refDate>

NOT USED

Reference period of the data.

example:
<refDate start="2002" end="2002" />

<releaseDate>

NOT USED

The data that the work was deposited with the archive.

example:
<releaseDate start="2002" end="2002" />

<keywords>

NOT USED

Words or phrases that describe salient aspects of a data collection's content.

example:
<keywords>
<keyword>job</keyword>
<keyword>occupation</keyword>
</keywords>

<notes>

NOT USED

Any additional information about the dataset.

example:
<notes type="html" uri="http://my.server.org/notes.html" />



Variable Level Metadata

– Child elements are repeated within each var element

REQUIRED FIELDS



Element Name

Description

<variables>

A parent element for all the variables.

Optional attribute
continues – Defining if a variable continues. Only used when defining differences between variable for the same dataset accross time. Allowed values ("Y" or "N"). Default value is "N".

example:
<variables>
  <var name="var1" >
    ....
  </var>
  <var name="var2" >
    ....
  </var>
</variables>

<var>

Item (variable) element. All child elements are attributes of the current item.

Required attribute
id – variable/column name. (limit 25 characters, cannot contain spaces).

Optional attribute
continues – Defining if a variable continues. Only used when defining difference between a variable for the same dataset accross time. Allowed values ("Y" or "N"). Default value is "N". This attribute overrides the continues atribute defined in <variables> element.

example:
<var name="var2" >
.....
</var>

<label>

– limit 60 characters, cannot contain quotation marks.

Short description of item that clearly and concisely identifies its content

example:
<label><![CDATA[Demographics – sex of person]]>/label>

<type>

Defines several attributes associated with each variable

attribte "type" use: future option
item data type. Most people will define type on dataset level.

  • aggregate - data which has already been summarized or added up. Usually for aspecific geographical units or some other unit such as industry, classifications in this case, each record is a geographical unit and there is no summing needed to get the totals for the geographies.
  • microdata - data in which every record is at the unit of analysis level and all records must be added up to get the totals for each data item. For example, for surveys of individuals, microdata contain records for each individual interviewed; for surveys of organizations, the microdata contain records for each organization.
  • longitudinal - a panael data in which many units are observed over multiple time periods. The bureau of Labor Statistics National Longitudinal Surveys program collects data from a particular age group of people over many years on an annual or biennial basis. The panel data track the same sample of individuals over many time periods.
  • timeseries - is a sequence of observations which are ordered in time (or space). If observations are made on some phenomenon thoughout time, it is most sensible to display the data in the order which they arose, particularly since successive observations will probably be dependent. Time is called the independent variable.

attribute "iterationgroup" use: optional must be an integer value.
Iteration group size for longitudinal data (e.g. if a variable repears 12 times then 12 would be the group size).

attribute "datatype" use: required

  • numeric - binary integer
  • floatingPoint
  • character - 1 to 255 characters
  • militaryTime - (HH:MM)
  • impliedDecimal - user defines the total length of value including the decimal (length attribute) and the number of digits to the right of the decimal (decimal attribute)
  • iso8602Date
  • other

attribute "decimal" use: required must be an integer value.
If this varaible is an integer value then the decimal value will be 0

attribute "geographyIndicator" use: optional
Force geographic wizard or geographic selection is required. This will only be used if geocodesetid attribute is defined.

  • Y - yes it is required
  • N - no it is not required

attribute "interval" use: future option

  • continuous
  • discrete
  • ordinal
  • nominal
  • percentage
  • ratio

attribute "isweight" use: required
this item IS a weight

  • Y - yes this item IS a weight
  • N - no this item IS NOT a weight

attribute "logicalType" use: required
This is a place to create special variable classifications, that analysts may want to use.

  • flag
  • edited
  • unedited
  • weighting
  • recode
  • topcoded
  • sampleControl
  • geography
  • replicateWeights
  • other
  • publicUse

attribute "weightvar" use: optional
Names a suggested weight to be used by default in tabulations or identifies a variable as a weight
To define a suggested weight, use the weight’s name
If there is NO suggested weight to be used omit this option

<security>

(OPTIONAL) default: public

Security Level (e.g. public or private) for a specific item. Most cases this field is not used.

    attribute "level"
  • public - public item
  • sponsor - private or embargo item
  • other - unknown

example:
<security level="public" />

OPTIONAL FIELDS



<concept>

(OPTIONAL) default: N/A

– limit 255 characters

Concept or topic label that variable is grouped into. Even though this field is optional it is strongly recommended to organized your varaibles into different topics.

example:
<concept><![CDATA[Demographic Variables]]<concept>

<unit>

(OPTIONAL) default: N/A

Unit type (e.g. dollars, minutes, percent, etc.)

    attribute "type"
  • absolute - absolute number
  • adverage - median or mean
  • dollars - dollar amount
  • minutes - time in minutes
  • percent
  • ratio
  • inches
  • degrees
  • squareMiles
  • thousandsDollars
  • incidentPoint
  • other

example:
<unit type="absolute" />

<universe>

(OPTIONAL) default: N/A

Universe description (must follow the long description). This is appended at the end of the long description

example:
<universe type="all" >Every Person</universe>

FUTURE FIELDS



<embargo>

NOT USED

Used to identify when a dataset should be released to the public.

Optional attributes
before - this item will be embargo if the date is before.
after - this item will be embargo if the date is after.

<security before="Jan 2006" />



OPTIONAL, but strongly recommended for microdata items that are not weights or allocation flags:

Element Name

Description

V

Values with descriptions (define all possible values). Label has a limit of 100 characters.
(e.g. 1=Male and 2=Female, or 0 to 99 years)

example: V 1 Male

example: V 2 Female

or an item with a numeric range for valid values, define the minimum and maximum –

example: V 0:99 Years

or IF the data contains a blank, the value should be defined exactly as -
example:
V Blank Defintion of blank value

:L:



:L:

Long description of item. This could include the full question text, interviewer instructions or recode/topcode definitions. These delimiters are on their own line, one above the description, and one below.

example: :L:

Enter Appropriate Sex.

Ask Only If Necessary: What is your sex?

:L:





OPTIONAL FIELDS

Element Name

Description

:A:

Attachment URL (e.g. Edit Specs, Recode Specs, Instrument Specs, etc.

example: :A: Edit Specifications

http://www.bls.census.gov/specs/pesex.htm

B

Synonyms (Multiple words should either be listed separately, or comma delimited).
example: B men
B boy
B gender
B women
B girl

or
example: B men, boy, gender, women, girl



**********************************************************************************

______________________________________

Detailed Dataset Information

______________________________________

<component>Dataset Name</component>

limit 255 characters

name generally used when referring to the dataset

______________________________________

<longName>Data Collection name</longName>

– limit 255 characters {defaults to Dataset Name}

A Collection is a dataset is collected on a continuing basis, for example if there is a new dataset every month or every year. Often a collection of datasets typically use many of the same questions. When a data set has been documented for the first time, only the variables that have changed from previous versions, or variables that have been added or discontinued from previous versions need to be documented.

______________________________________

<shortName>SN</shortName>

– limit 12 characters

Shortened or abbreviated name or acronym for the datacollection

This is the abbreviated name of the dataset. This often appears as mouse overs on menus.

______________________________________

<instance>2006</instance>

– the time period the dataset refers to in it's questions or administrative processes.

______________________________________

<category type="aggregate" />

Dataset data category (microdata, aggregate, time series, longitudinal)

– Microdata is individual record data that has not been aggregated into counts, averages, medians, rates etc. The system is designed to aggregate these records into aggregate these records very efficiently and flexibly.

– Aggregate data are data that are already summarized into counts, weighted counts etc. Such data sets may be very large and the system is designed to retrieve such data very quickly and allow the analyst to arrange them and manipulate them in a spreadsheet very easily. Aggregated data may be organized by certain variables like geography, or industry and may be meaningless unless those “required” variables are part of the selection or spreadsheet that is displayed. The system keeps track of such variables and prompts the user to select those variables. Time series data is aggregated data that is kept as a trend over time. It is typically counts, averages, rates etc. and may also be transformed into indexes, be adjusted for inflation, be seasonally adjusted or be a moving average or a growth rate.

– Longitudinal data is microdata, (data kept on individual persons, companies etc. ) that is kept as individual data records over time. It is designed to keep records on individuals over time, so that a researcher can statistically track life processes.

______________________________________

<display type="normal" />

(OPTIONAL) default: normal

– Affects the way component name and time are displayed in DataFerrett

Display type (the way the dataset is displayed in the list of datasets in the Data Ferrett Browser)

This is “display metadata” it does not describe the data, but describes how the dataset will be displayed in the Data Ferrett tool. Specifically, it describes which dataset names will be major folders and which will be sub folders. This is typically not used by most simple datasets, and is only used by complex datasets with many sub datasets that need to be combined to be used effectively by the analysts.

______________________________________

<tabulationHost uri="http://my.server.org" type="4505" />

Must contain the following attributes

uri – Extraction machine address and port (either domain name or IP address), but must be fully qualified and match the ioapi configuration file. Tabulation and extraction often is the same machine with the same type

type – Database type id. (refers to ioapi file "thsp" parameter)

______________________________________

<extractionHost uri="http://my.server.org" type="4505" />

Must contain the following attributes

uri – Extraction machine address and port (either domain name or IP address), but must be fully qualified and match the ioapi configuration file ("thsm" parameter). Tabulation and extraction often is the same machine with the same type

type – Database type id. (refers to ioapi file "thsp" parameter)

______________________________________

<subsurveyName>Dataset Intermediate Level Name</subsurveyName>

(OPTIONAL) default: N/A

Dataset Intermediate Level Name (sub-collection) – limit 255 characters

Rarely used, this is only used by very complex datasets that are made up of a collection of sub datasets. Contact the DataWeb team for more information.

______________________________________

<inheritedComponent>Inherited Dataset Name</inheritedComponent>

(OPTIONAL) default: N/A

Inherited Dataset Name (Must use an existing dataset’s name, within the same Data Collection; e.g. CPS supplements inherit Basic CPS). Since a dataset can inherit multiple component this element can be repeated within a dataset.

Rarely used, this is only used by very complex datasets that are made up of a collection of sub datasets. Contact the DataWeb team for more information.

______________________________________

<embargo id="0" />

(OPTIONAL) default: id=0

for placing embargo rights on an instance of a dataset. This is assuming the group id has already been assigned. An id of 0 represents a public dataset

______________________________________

<logo uri="http://my.server.org/images/logo.jpg" />

(OPTIONAL) default: N/A

Logo/Banner image URL for the dataset

______________________________________

<sponsorInfo name="org name" homepageUrl="http://my.server.org/index.html" imageUrl="http://my.server.org/images/image.jpg" />

(OPTIONAL) default: N/A

Sponsor Information

attribte names

______________________________________

<providerInfo name="org name" homepageUrl="http://my.server.org/index.html" imageUrl="http://my.server.org/images/image.jpg" />

(OPTIONAL) default: N/A

Provider Information

attribte names

______________________________________

<abstract originaluri="http://my.server.org/abstracts/abstract.html" />

(OPTIONAL) default: N/A

Dataset Description (Abstract) URL

______________________________________

<restriction originaluri="http://my.server.org/abstracts/abstract.html" />

(OPTIONAL) default: N/A

Dataset Restriction URL

Statement of dataset access or use restrictions.

______________________________________

<virtualId>id2009:22/test.tab</virtualId>

(OPTIONAL) default: N/A

ID used by the Harvard VDC System. Should not be used with anything else.

______________________________________

<collectDate start="2002" end="2002" />

NOT USED

Date the data were collected.

______________________________________

<refDate start="2002" end="2002" />

NOT USED

Reference period of the data.

______________________________________

<releaseDate start="2002" end="2002" />

NOT USED

The data that the work was deposited with the archive.

______________________________________

<keywords>

NOT USED

Words or phrases that describe salient aspects of a data collection's content.

______________________________________

<notes type="html" uri="http://my.server.org/notes.html" />

NOT USED

Any additional information about the dataset.

______________________________________


**********************************************************************************

______________________________________

Detailed Variable Level Metadata

______________________________________

<variables>

A parent element for all the variables.

Optional attribute
continues – Defining if a variable continues. Only used when defining differences between variable for the same dataset accross time. Allowed values ("Y" or "N"). Default value is "N".

example 1:
<variables>
  <var name="var1" >
    ....
  </var>
  <var name="var2" >
    ....
  </var>
</variables>

example 2:
note: (This example shows that all variables defined will continue to the next instance of the dataset. Local <var> continues attribute will override "global" <variables> attribute.. For most datasets this attribute is not needed.)
<variables continues="Y" >
  <var name="var1" >
    ....
  </var>
  <var name="var2" >
    ....
  </var>
</variables>

______________________________________

<var id="PERACE" >

Item (variable) element. All child elements are attributes of the current item.

Required attribute
id – variable/column name. (limit 25 characters, cannot contain spaces).

Optional attribute
continues – Defining if a variable continues. Only used when defining difference between a variable for the same dataset accross time. Allowed values ("Y" or "N"). Default value is "N". This attribute overrides the continues atribute defined in <variables> element.

example:
note: (This example shows that var1 will continue to the next instance of the dataset. For most datasets this attribute is not needed.)
<variables>
  <var name="var1" continues="Y" >
    ....
  </var>
  <var name="var2">
    ....
  </var>
</variables>

example 2:
note: (This example shows that all variables defined will continue to the next instance of the dataset with the exeption of var2. Local <var> continues attribute will override "global" <variables> attribute. For most datasets this attribute is not needed.)
<variables continues="Y" >
  <var name="var1" >
    ....
  </var>
  <var name="var2" continues="N" >
    ....
  </var>
  <var name="var3" >
    ....
  </var>
</variables>

______________________________________

<label>

– limit 60 characters, cannot contain quotation marks.

Short description of item that clearly and concisely identifies its content

example:
<label><![CDATA[Demographics – sex of person]]</label>

______________________________________

<concept>

(OPTIONAL) default: N/A

– limit 255 characters

Concept or topic label that variable is grouped into. Even though this field is optional it is strongly recommended to organized your varaibles into different topics.

example:
<concept><![CDATA[Demographic Variables]]<concept>

______________________________________

<embargo before="Jan 2006" />

NOT USED

Used to identify when a dataset should be released to the public.

Optional attributes
before - this item will be embargo if the date is before.
after - this item will be embargo if the date is after.

______________________________________

<security level="public" />

(OPTIONAL) default: public

Security Level (e.g. public or private) for a specific item. Most cases this field is not used.

______________________________________

<unit type="absolute" />

(OPTIONAL) default: N/A

Unit type (e.g. dollars, minutes, percent, etc.)

______________________________________

<type>

Defines several attributes associated with each variable

attribte "type" use: future option
item data type. Most people will define type on dataset level.

attribute "iterationgroup" use: optional
Iteration group size for longitudinal data (e.g. if a variable repears 12 times then 12 would be the group size). This attribute if used must contain a integer value.
This is only used for Longitudinal data, (see Longitudinal data described in the data set type described above). This describes data that repeat for an individual over time. For example, the race of a person does not change so there are no iterations associated with a race variable. Variables like income will change over time, so the income variable may have an iteration associated with it (i.e. Income1, Income2, Income3.... IncomeN)

attribute "datatype" use: required

attribute "decimal" use: optional must be an integer value.
If this varaible is an integer value then the decimal value will be 0

attribute "geographyIndicator" use: optional
Force geographic wizard or geographic selection is required. This will only be used if geocodesetid attribute is defined.

attribute "geocodesetid" use: optional must be an integer value.
If this is a geographic standard variable then no values needs to be added. Only need to identify the correct geographic codeset id. This is only a tempary attribute and needs to be modified.
This is tells the system that the variable described is a geocode, and can be associated with a specific geography (a map polygon, line, or point). Typically these geocodes are standard geographies which may be used to match data from one dataset to another, as well as can be used for mapping.

attribute "interval" use: future option

attribute "isweight" use: required
this item IS a weight

attribute "logicalType" use: required
This is a place to create special variable classifications, that analysts may want to use.

attribute "weightvar" use: optional
Names a suggested weight to be used by default in tabulations or identifies a variable as a weight
To define a suggested weight, use the weight’s name
If there is NO suggested weight to be used omit this option

______________________________________

<universe type="all">

(OPTIONAL) default: N/A

Universe description (must follow the long description).

This will be displayed along with the “Long Description” described above. The universe describes the type of people that may answer this question. It typically is determined by the answers the respondent gave to earlier questions. For example, the question may ask “how long have you been unemployed”. Only people who have previously said that they are unemployed, will be asked this question.

______________________________________

V: Values with descriptions (define all possible values). Label has a limit of 100 characters.
(e.g. 1=Male and 2=Female, or 0 to 99 years)

This are the labels for values of microdata questions. Microdata is very difficult to use without these labels. The system uses the labels to create labels for any tabulations, maps or business graphics that are done. Typically, the microdata variables can be traced to a question on a questionnaire for a survey or poll, or the elements in a form that is filled out, if the data is coming from a government form or business process if the data is coming from administrative data.

example: V 1 Male

example: V 2 Female

or an item with a numeric range for valid values, define the minimum and maximum

example: V 0:99 Years

or IF the data contains a blank, the value should be defined exactly as -
example:
V Blank Defintion of blank value

______________________________________

:L: Long description of item.

This could include the full question text, in a survey or administrative form, interviewer i nstructions for a survey, or recode/topcode definitions. These delimiters are on their own line, one above the description, and one below.

example: :L:

Enter Appropriate Sex.

Ask Only If Necessary: What is your sex?

:L:

______________________________________

A: Attachment URL (e.g. Edit Specs, Recode Specs, Instrument Specs, etc.

example: :A: Edit Specifications

http://www.bls.census.gov/specs/pesex.htm

______________________________________

B: Synonyms (Multiple words should either be listed separately, or comma delimited).

This allows metadata creators to put commonly used synonyms into the search facility.


example: B men
B boy
B gender
B women
B girl
or
example: B men, boy, gender, women, girl

______________________________________


**********************************************************************************

mifNew example

Energy Consumption by Zipcode 2004 Metropolitan Council of Governments MCOG

<dataset>
   <embargo>
   <component>
   <instance>
   <longName>
   <shortName>
   <subsurveyName>
   <inheritedComponent>
   <collectDate>
   <refDate>
   <releaseDate>
   <virtualId>
   <category>
   <display>
   <extractionHost>
   <tabulationHost>
   <keywords>
   <sponsorInfo>
   <providerInfo>
   <logo>
   <restriction>
   <abstract>
   <notes>
</dataset>



MIF XML Transactions
Dataset Level Metadata
Item Level Metadata
new component mif example