DataFerrett Metadata Interface File (MIF) mifNew XML Reference
Note: If not using the (OPTIONAL) element or attribute, DO NOT include the element in the MIF. Errors might occur if the element or attribute is blank/empty.
Note: If not using the (FUTURE OPTION) the MIF will validate currently will not be used in the dataweb system.
MIF XML Transactions
Dataset Level Metadata
Variable Level Metadata
new component mif example
REQUIRED FIELDS
|
Element Name |
Description |
|
Dataset Name – limit 255 characters example: |
|
|
Data Collection long name – limit 255 characters full name of the data collection example:<longName>Current Population Survey</longName> |
|
|
Data Collection short name – limit 12 characters short name (acronym) – limit 255 characters example:<shortName>CPS</shortName> |
|
|
Dataset instance description This is often a year, or month and year for the period which the data was collected or referenced. example:<instance>Jan 1994</instance> |
|
|
Dataset data category with the following type attribute (microdata, aggregate, timeseries, longitudinal, multiDimensionCube, or multiDimensionCubeTimeSeries)
<category type="aggregate" /> |
|
|
Must contain the following attributes uri – Tabulation machine address and port (either domain name or IP address), but must be fully qualified and match the ioapi configuration file ("thsm" parameter). type – Database type id. (refers to ioapi file "thsp" parameter) example:<tabulationHost uri="http://my.server.org" type="4505" /> |
|
|
Must contain the following attributes uri – Extraction machine address and port (either domain name or IP address), but must be fully qualified and match the ioapi configuration file type – Database type id. (refers to ioapi file "thsp" parameter) example:<tabulationHost uri="http://my.server.org" type="4505" /> |
OPTIONAL FIELDS
|
Element Name |
Description |
|
(OPTIONAL) default: N/A Dataset Intermediate Level Name (sub-collection) – limit 255 characters example:<subsurveyName>Intermediate Name</subsurveyName> |
|
|
(OPTIONAL) default: normal Affects the way component name and time are displayed in DataFerrett Display attribte type
<display type="normal" > |
|
|
(OPTIONAL) default: N/A Inherited Dataset Name (Must use an existing dataset’s name, within the same Data Collection; e.g. CPS supplements inherit Basic CPS). Since a dataset can inherit multiple component this element can be repeated within a dataset. example:<inheritedComponent>Basic</inheritedComponent> |
|
|
(OPTIONAL) default: id=0 for placing embargo rights on an instance of a dataset. This is assuming the group id has already been assigned. An id of 0 represents a public dataset example:<embargo id="0" /> |
|
|
(OPTIONAL) default: N/A Logo image URL for the dataset An example of the logo http://pubdb3.census.gov/images/cps_banner.jpg example:<logo uri="http://my.server.org/images/logo.jpg" /> |
|
|
(OPTIONAL) default: N/A Sponsor Information attribte names
|
|
|
(OPTIONAL) default: N/A Provider Information attribte names
|
|
|
(OPTIONAL) default: N/A Dataset Description (Abstract) URL example:<abstract originaluri="http://pubdb3.census.gov/abstracts/cps_basic.html" /> |
|
|
(OPTIONAL) default: N/A Dataset restrictions URL Statement of dataset access or use restrictions. example:<restriction originaluri="http://pubdb3.census.gov/abstracts/cps_basic.html" /> |
|
|
(OPTIONAL) default: N/A ID used by the Harvard VDC System. Should not be used with anything else. example:<virtualId>id2009:22/test.tab</virtualId> |
FUTURE FIELDS
These fields are defined in the schema, but NOT currently used by DataWeb Publisher – for future use only.
|
Element Name |
Description |
|
NOT USED Date the data were collected. example:<collectDate start="2002" end="2002" /> |
|
|
NOT USED Reference period of the data. example:<refDate start="2002" end="2002" /> |
|
|
NOT USED The data that the work was deposited with the archive. example:<releaseDate start="2002" end="2002" /> |
|
|
NOT USED Words or phrases that describe salient aspects of a data collection's content. example:<keywords> <keyword>job</keyword> <keyword>occupation</keyword> </keywords> |
|
|
NOT USED Any additional information about the dataset. example:<notes type="html" uri="http://my.server.org/notes.html" /> |
– Child elements are repeated within each var element
REQUIRED FIELDS
|
Element Name |
Description |
|
A parent element for all the variables. Optional attribute <variables> <var name="var1" > .... </var> <var name="var2" > .... </var> </variables> |
|
|
Item (variable) element. All child elements are attributes of the current item. Required attribute Optional attribute <var name="var2" > ..... </var> |
|
|
– limit 60 characters, cannot contain quotation marks. Short description of item that clearly and concisely identifies its content example:<label><![CDATA[Demographics – sex of person]]>/label> |
|
|
Defines several attributes associated with each variable attribte "type" use: future option
attribute "iterationgroup" use: optional must be an integer value. attribute "datatype" use: required
attribute "decimal" use: required must be an integer value. attribute "geographyIndicator" use: optional
attribute "interval" use: future option
attribute "isweight" use: required
attribute "logicalType" use: required This is a place to create special variable classifications, that analysts may want to use.
attribute "weightvar" use: optional Names a suggested weight to be used by default in tabulations or identifies a variable as a weight To define a suggested weight, use the weight’s name If there is NO suggested weight to be used omit this option |
|
|
(OPTIONAL) default: public Security Level (e.g. public or private) for a specific item. Most cases this field is not used.
example: |
OPTIONAL FIELDS
|
(OPTIONAL) default: N/A – limit 255 characters Concept or topic label that variable is grouped into. Even though this field is optional it is strongly recommended to organized your varaibles into different topics. example: |
|
|
(OPTIONAL) default: N/A Unit type (e.g. dollars, minutes, percent, etc.)
example: |
|
|
(OPTIONAL) default: N/A Universe description (must follow the long description). This is appended at the end of the long description example: |
FUTURE FIELDS
|
NOT USED Used to identify when a dataset should be released to the public. Optional attributes <security before="Jan 2006" /> |
OPTIONAL, but strongly recommended for microdata items that are not weights or allocation flags:
|
Element Name |
Description |
|
Values
with descriptions (define all possible values). Label has a limit
of 100 characters. example: V 1 Male example: V 2 Female or an item with a numeric range for valid values, define the minimum and maximum – example: V 0:99 Years or
IF the data contains a blank, the value should be defined exactly
as - |
|
|
:L: |
Long description of item. This could include the full question text, interviewer instructions or recode/topcode definitions. These delimiters are on their own line, one above the description, and one below. example: :L: Enter Appropriate Sex. Ask Only If Necessary: What is your sex? :L: |
|
|
|
OPTIONAL FIELDS
|
Element Name |
Description |
|
Attachment URL (e.g. Edit Specs, Recode Specs, Instrument Specs, etc. example: :A: Edit Specifications http://www.bls.census.gov/specs/pesex.htm |
|
|
Synonyms
(Multiple words should either be listed separately, or comma
delimited).
|
**********************************************************************************
______________________________________
Detailed Dataset Information
______________________________________
<component>Dataset Name</component>
– limit 255 characters
name generally used when referring to the dataset
______________________________________
<longName>Data Collection name</longName>
– limit 255 characters {defaults to Dataset Name}
A Collection is a dataset is collected on a continuing basis, for example if there is a new dataset every month or every year. Often a collection of datasets typically use many of the same questions. When a data set has been documented for the first time, only the variables that have changed from previous versions, or variables that have been added or discontinued from previous versions need to be documented.
______________________________________
– limit 12 characters
Shortened or abbreviated name or acronym for the datacollection
This is the abbreviated name of the dataset. This often appears as mouse overs on menus.
______________________________________
– the time period the dataset refers to in it's questions or administrative processes.
______________________________________
Dataset data category (microdata, aggregate, time series, longitudinal)– Microdata is individual record data that has not been aggregated into counts, averages, medians, rates etc. The system is designed to aggregate these records into aggregate these records very efficiently and flexibly.
– Aggregate data are data that are already summarized into counts, weighted counts etc. Such data sets may be very large and the system is designed to retrieve such data very quickly and allow the analyst to arrange them and manipulate them in a spreadsheet very easily. Aggregated data may be organized by certain variables like geography, or industry and may be meaningless unless those “required” variables are part of the selection or spreadsheet that is displayed. The system keeps track of such variables and prompts the user to select those variables. Time series data is aggregated data that is kept as a trend over time. It is typically counts, averages, rates etc. and may also be transformed into indexes, be adjusted for inflation, be seasonally adjusted or be a moving average or a growth rate.
– Longitudinal data is microdata, (data kept on individual persons, companies etc. ) that is kept as individual data records over time. It is designed to keep records on individuals over time, so that a researcher can statistically track life processes.
______________________________________
(OPTIONAL) default: normal
– Affects the way component name and time are displayed in DataFerrett
Display type (the way the dataset is displayed in the list of datasets in the Data Ferrett Browser)
This is “display metadata” it does not describe the data, but describes how the dataset will be displayed in the Data Ferrett tool. Specifically, it describes which dataset names will be major folders and which will be sub folders. This is typically not used by most simple datasets, and is only used by complex datasets with many sub datasets that need to be combined to be used effectively by the analysts.
______________________________________
<tabulationHost uri="http://my.server.org" type="4505" />
Must contain the following attributes
uri – Extraction machine address and port (either domain name or IP address), but must be fully qualified and match the ioapi configuration file. Tabulation and extraction often is the same machine with the same type
type – Database type id. (refers to ioapi file "thsp" parameter)
______________________________________
<extractionHost uri="http://my.server.org" type="4505" />
Must contain the following attributes
uri – Extraction machine address and port (either domain name or IP address), but must be fully qualified and match the ioapi configuration file ("thsm" parameter). Tabulation and extraction often is the same machine with the same type
type – Database type id. (refers to ioapi file "thsp" parameter)
______________________________________
<subsurveyName>Dataset Intermediate Level Name</subsurveyName>
(OPTIONAL) default: N/A
Dataset Intermediate Level Name (sub-collection) – limit 255 characters
Rarely used, this is only used by very complex datasets that are made up of a collection of sub datasets. Contact the DataWeb team for more information.
______________________________________
<inheritedComponent>Inherited Dataset Name</inheritedComponent>
(OPTIONAL) default: N/A
Inherited Dataset Name (Must use an existing dataset’s name, within the same Data Collection; e.g. CPS supplements inherit Basic CPS). Since a dataset can inherit multiple component this element can be repeated within a dataset.
Rarely used, this is only used by very complex datasets that are made up of a collection of sub datasets. Contact the DataWeb team for more information.
______________________________________
(OPTIONAL) default: id=0
for placing embargo rights on an instance of a dataset. This is assuming the group id has already been assigned. An id of 0 represents a public dataset
______________________________________
<logo uri="http://my.server.org/images/logo.jpg" />
(OPTIONAL) default: N/A
Logo/Banner image URL for the dataset
______________________________________
<sponsorInfo name="org name" homepageUrl="http://my.server.org/index.html" imageUrl="http://my.server.org/images/image.jpg" />
(OPTIONAL) default: N/A
Sponsor Information
attribte names
______________________________________
<providerInfo name="org name" homepageUrl="http://my.server.org/index.html" imageUrl="http://my.server.org/images/image.jpg" />
(OPTIONAL) default: N/A
Provider Information
attribte names
______________________________________
<abstract originaluri="http://my.server.org/abstracts/abstract.html" />
(OPTIONAL) default: N/A
Dataset Description (Abstract) URL
______________________________________
<restriction originaluri="http://my.server.org/abstracts/abstract.html" />
(OPTIONAL) default: N/A
Dataset Restriction URL
Statement of dataset access or use restrictions.
______________________________________
<virtualId>id2009:22/test.tab</virtualId>
(OPTIONAL) default: N/A
ID used by the Harvard VDC System. Should not be used with anything else.
______________________________________
<collectDate start="2002" end="2002" />
NOT USED
Date the data were collected.
______________________________________
<refDate start="2002" end="2002" />
NOT USED
Reference period of the data.
______________________________________
<releaseDate start="2002" end="2002" />
NOT USED
The data that the work was deposited with the archive.
______________________________________
NOT USED
Words or phrases that describe salient aspects of a data collection's content.
______________________________________
<notes type="html" uri="http://my.server.org/notes.html" />
NOT USED
Any additional information about the dataset.
______________________________________
**********************************************************************************
______________________________________
Detailed Variable Level Metadata
______________________________________
A parent element for all the variables.
Optional attribute
continues – Defining if a variable continues. Only used when defining differences between variable for the same dataset accross time. Allowed values ("Y" or "N"). Default value is "N".
example 2:
note: (This example shows that all variables defined will continue to the next instance of the dataset. Local <var> continues attribute will override "global" <variables> attribute.. For most datasets this attribute is not needed.)
<variables continues="Y" >
<var name="var1" >
....
</var>
<var name="var2" >
....
</var>
</variables>
______________________________________
Item (variable) element. All child elements are attributes of the current item.
Required attribute
id – variable/column name. (limit 25 characters, cannot contain spaces).
Optional attribute
continues – Defining if a variable continues. Only used when defining difference between a variable for the same dataset accross time. Allowed values ("Y" or "N"). Default value is "N". This attribute overrides the continues atribute defined in <variables> element.
example 2:
note: (This example shows that all variables defined will continue to the next instance of the dataset with the exeption of var2. Local <var> continues attribute will override "global" <variables> attribute. For most datasets this attribute is not needed.)
<variables continues="Y" >
<var name="var1" >
....
</var>
<var name="var2" continues="N" >
....
</var>
<var name="var3" >
....
</var>
</variables>
______________________________________
– limit 60 characters, cannot contain quotation marks.
Short description of item that clearly and concisely identifies its content
example:
<label><![CDATA[Demographics – sex of person]]</label>
______________________________________
(OPTIONAL) default: N/A
– limit 255 characters
Concept or topic label that variable is grouped into. Even though this field is optional it is strongly recommended to organized your varaibles into different topics.
example:
<concept><![CDATA[Demographic Variables]]<concept>
______________________________________
NOT USED
Used to identify when a dataset should be released to the public.
Optional attributes
before - this item will be embargo if the date is before.
after - this item will be embargo if the date is after.
______________________________________
(OPTIONAL) default: public
Security Level (e.g. public or private) for a specific item. Most cases this field is not used.
______________________________________
(OPTIONAL) default: N/A
Unit type (e.g. dollars, minutes, percent, etc.)
______________________________________
Defines several attributes associated with each variable
attribte "type" use: future option
item data type. Most people will define type on dataset level.
attribute "iterationgroup" use: optional
Iteration group size for longitudinal data (e.g. if a variable repears 12 times then 12 would be the group size). This attribute if used must contain a integer value.
This is only used for Longitudinal data, (see Longitudinal data described in the data set type described above). This describes data that repeat for an individual over time. For example, the race of a person does not change so there are no iterations associated with a race variable. Variables like income will change over time, so the income variable may have an iteration associated with it (i.e. Income1, Income2, Income3.... IncomeN)
attribute "datatype" use: required
attribute "decimal" use: optional must be an integer value.
If this varaible is an integer value then the decimal value will be 0
attribute "geographyIndicator" use: optional
Force geographic wizard or geographic selection is required. This will only be used if geocodesetid attribute is defined.
attribute "geocodesetid" use: optional must be an integer value.
If this is a geographic standard variable then no values needs to be added. Only need to identify the correct geographic codeset id. This is only a tempary attribute and needs to be modified.
This is tells the system that the variable described is a geocode, and can be associated with a specific geography (a map polygon, line, or point). Typically these geocodes are standard geographies which may be used to match data from one dataset to another, as well as can be used for mapping.
attribute "interval" use: future option
attribute "isweight" use: required
this item IS a weight
attribute "logicalType" use: required This is a place to create special variable classifications, that analysts may want to use.
attribute "weightvar" use: optional Names a suggested weight to be used by default in tabulations or identifies a variable as a weight To define a suggested weight, use the weight’s name If there is NO suggested weight to be used omit this option
______________________________________
(OPTIONAL) default: N/A
Universe description (must follow the long description).
This will be displayed along with the “Long Description” described above. The universe describes the type of people that may answer this question. It typically is determined by the answers the respondent gave to earlier questions. For example, the question may ask “how long have you been unemployed”. Only people who have previously said that they are unemployed, will be asked this question.
______________________________________
V: Values with
descriptions (define all possible values). Label has a limit of 100
characters.
(e.g. 1=Male and 2=Female, or 0 to 99 years)
This are the labels for values of microdata questions. Microdata is very difficult to use without these labels. The system uses the labels to create labels for any tabulations, maps or business graphics that are done. Typically, the microdata variables can be traced to a question on a questionnaire for a survey or poll, or the elements in a form that is filled out, if the data is coming from a government form or business process if the data is coming from administrative data.
example: V 1 Male
example: V 2 Female
or an item with a numeric range for valid values, define the minimum and maximum
example: V 0:99 Years
or IF the
data contains a blank, the value should be defined exactly as -
example: V Blank Defintion of blank value
______________________________________
This could include the full question text, in a survey or administrative form, interviewer i nstructions for a survey, or recode/topcode definitions. These delimiters are on their own line, one above the description, and one below.
example: :L:
Enter Appropriate Sex.
Ask Only If Necessary: What is your sex?
:L:
______________________________________
A: Attachment URL (e.g. Edit Specs, Recode Specs, Instrument Specs, etc.
example: :A: Edit Specifications
http://www.bls.census.gov/specs/pesex.htm
______________________________________
B: Synonyms (Multiple words should either be listed separately, or comma delimited).
This allows metadata creators to put commonly used synonyms into the search facility.
example: B men
B boy
B gender
B women
B girl
or
example: B men, boy, gender, women,
girl
______________________________________
**********************************************************************************