The following describes the layout of the DataFerrett Metadata Interface File (MIF). Metadata is the information that defines a dataset and the variables, or items, found within that dataset. This includes the name of the Data Collection, the name of each dataset within that collection, the time period for the dataset, and the name, description, and values of each item in the dataset, as well as other information. The MIF is an ASCII file that is used to populate the DataFerrett metadata database. The DataFerrett metadata database contains all of the information passed to the users through the DataFerrett data access tool application.
Each piece of metadata is denoted by a 2-3 character delimiter (or token). Note that the delimiters should start in column one followed immediately by at least one space. The delimiters surrounded by colons allow for multiple line entries without having the delimiter at the beginning of each line. They should be at the beginning and ending of the text.
A MIF file is split up into two segments: Dataset Level Metadata and Item Level Metadata. We will look at each segment separately.
The Dataset Level Metadata segment defines the dataset and must come first. Before we define the tokens and information needed, we will discuss exactly what we mean by a “dataset.” In DataFerrett, there can be up to three levels of a “dataset” defined. The highest level is what we refer to as a Data Collection. A collection is comprised of one or more lower level datasets. For example, a survey such as the Current Population Survey (CPS) is comprised of a set of basic questions asked every month, and specialized “supplements” that are asked every so often. Therefore, the data collection is the Current Population Survey, and the basic monthly questions and each supplement are each considered a dataset within that collection. There are only two levels in this “dataset” example.
Some data collections may consist of three levels, basically an intermediate level between the “collection” and the “dataset.” You can consider it a “sub-collection” or “sub-grouping” within a data collection. For example, the Survey of Income and Program Participation (SIPP) is a data collection. SIPP is a longitudinal survey that is broken into “panels” and the data is released for each panel. But each in each panel, there are “core” questions that are asked during each time period of the panel, and also “topical module” questions that are asked only at certain points in time. Therefore, in this case, the panels are an intermediate level between the collection, SIPP, and the datasets, Core and Topical Modules.
The last part of defining a dataset, is the time period of the dataset. This is typically a year, or month and year, for the period for which the data was collected. The data collector, or provider, determines the time period for a dataset. When defining datasets in a MIF, you can only enter information for one time period in each MIF. However, DataFerrett has the ability to use data from different time periods for the same dataset, as long as the data items are defined the same way from one time period to the next. This will be discussed in more detail later.
Now we will briefly describe each Dataset Level Metadata MIF delimiter, or token, and then look at each of them in more detail. Each token for Dataset Level Metadata contains two or three letters beginning with an S. The following are valid Dataset Level tokens. The tag [Optional] at the end of the token description indicates that the token is not required. All other tokens are required. Also please note that a MIF can contain comments which are defined by a # at the beginning of the line.
VER Version number (version number of metadata publishing system, must be the very first line in MIF, currently 1.0)
SO Dataset operation (NEW or UPDATE)
SC Dataset Name (limit 255 characters)
SL Data Collection long name (limit 255 characters)
SS Data Collection short name (limit 12 characters)
SB Intermediate Level Name [Optional] (limit 255 characters)
ST Dataset time frame, startdate:stopdate (must have start date and stop date, e.g. Jan 2000:Jan 2000 or 2001:2001)
SD Dataset data category (1=microdata, 2=aggregate data)
SZ Display type in DataFerrett (1=Normal(time at lowest level), 2=Inverted(dataset at lowest level))
SA Tabulation machine address and port (either domain name or IP address, if no port given, 4505 will be assigned)
SX Extraction machine address and port (either domain name or IP address, 4505 will be assigned)
SN Longitudinal data (YES=longitudinal) [Optional]
SI Inherited Dataset Name [Optional] (Must use an existing dataset’s name (SC token), within same Data Collection)
SU Logo image URL for the dataset [Optional] (fully qualified URL, e.g. http://www.name.com/image/datasetlogo.gif)
SSM Sponsor Name [Optional]
SSU Sponsor URL [Optional]SSB Sponsor Banner URL [Optional]SPM Provider Name [Optional]SPU Provider URL [Optional]SPB Provider Banner URL [Optional]SDU Document URL [Optional]
Dataset Level
Tokens – A Detailed Look
Throughout this section we will describe each token, then add that token to an example MIF. After all tokens have been described, we will end up with a fully defined MIF at the dataset level. For our example we will define a health dataset, namely one part of the National Hospital Ambulatory Medical Care Survey (NHAMCS).
VER
Syntax – VER x.y
This defines which version of the metadata publishing system that you are using to publish the dataset to DataFerrett. This MUST be the first line of every MIF. The version is important because the version you are using on your local machine must match the version in use on the centralized DataWeb site when you try to publish there. Currently in use is version 1.0.
Example MIF:
VER 1.0
SO
Syntax – SO OPERATION
Valid operations: NEW, UPDATE
This defines the operation that you are performing on the dataset you are defining. You use the NEW operation the very first time, and only the first time, you publish metadata for a specific dataset. Once a dataset has been published, you use the UPDATE operation, whether you are making dataset level metadata changes, or item level metadata changes. There is also an item level operation token (GO), which will be discussed in that section. Both tokens are necessary in every MIF.
In our example, we will assume that we are publishing this datasets metadata for the first time. Later we will discuss the UPDATE operation in detail.
Example MIF:
VER 1.0
SO NEW
Syntax – SC Dataset Name
Size Limit – 255 characters
This defines the name of the dataset at the lowest level of the data collection. For our example, the National Hospital Ambulatory Medical Care Survey is comprised of two distinct sections, each with its own set of data. Therefore, we would need a MIF for each section, where the dataset name is different, but both share the same data collection name. In this example we will define one of the sections, the Outpatient Department dataset.
Example MIF:
VER 1.0
SO NEW
SC Outpatient Department
SL
Syntax - SL Data Collection long name
Size Limit – 255 characters
This defines the full name of the data collection.
Example MIF:
VER 1.0
SO NEW
SC Outpatient Department
SL National Hospital Ambulatory Medical Care Survey
SS
Syntax - SS Data Collection short name
Size Limit – 12 characters
This is an acronym or abbreviation for the Data Collection. It is used by the DataFerrett system to identify the dataset when tabulating or extracting the data.
Example MIF:
VER 1.0
SO NEW
SC Outpatient Department
SL National Hospital Ambulatory Medical Care Survey
SS NHAMCS
SB [Optional]
Syntax – SB Intermediate Level Name
Size Limit – 255 characters
This is an optional level of data for those datasets that have a level between the data collection and the data set. Many datasets do not contain this middle level of definition. Our example does not need it, so therefore we DO NOT include the token in our MIF.
Example MIF:
VER 1.0
SO NEW
SC Outpatient Department
SL National Hospital Ambulatory Medical Care Survey
SS NHAMCS
ST
Syntax – ST startdate:stopdate This defines the dataset time frame, which must have start date and stop date separated by a colon, e.g. Jan 2000:Jan 2000 or 2001:2001. When a month is part of the time period, it must be defined using the 3 letter abbreviation with the first letter in upper case. Also, there must be a space between the month and the year. The very first time you publish information for a dataset, the start and stop dates MUST BE THE SAME. Then, for every new time period that you add, the start date will always stay the same and the stop date will be changed to the next time period. This will be discussed in more detail when the UPDATE operation is discussed. For our example, the first dataset that we are publishing is the NHAMCS data collected for the year 1996.
Example MIF:
VER 1.0
SO NEW
SC Outpatient Department
SL National Hospital Ambulatory Medical Care Survey
SS NHAMCS
ST 1996:1996
SD
Syntax – SD data category # Valid entries – 1 or 2 (1=microdata, 2=aggregate data) There are basically two types of data that can be put into and accessed through DataFerrett, microdata and aggregate data. Microdata is data in which every record is at the unit of anlysis level and all records must be added up to get the totals for each data item. For example, for surveys of individuals, microdata contain records for each individual interviewed; for surveys of organizations, the microdata contain records for each organization. Aggregate data is data which has already been summarized or added up, usually for specific geographical units or some other unit, such as industry classifications. In this case, each record is a geographical unit and there is no summing needed to get the totals for the geographies. In our example, the NHAMCS is a survey that collects the data from every outpatient hospital visit at the hospitals in the survey sample. Therefore, this dataset is a microdata dataset, where you must add up the responses to all the questions in order to get the totals. So, we will define this token as SD 1 (since 1=microdata).
Example MIF:
VER 1.0
SO NEW
SC Outpatient Department
SL National Hospital Ambulatory Medical Care Survey
SS NHAMCS
ST 1996:1996
SD 1
SZ
Syntax – SZ Display type # Valid entries – 1 or 2 (1=Normal, 2=Inverted) The display type refers to the way the dataset and time period information is displayed in the available datasets tree in DataFerrett. Display type = 1 refers to a normal tree view where the time period of the dataset is at the lowest level). Display type = 2 refers to an inverted tree view where the dataset name is at the lowest level, directly below the time period. See examples. There are reasons for doing it one way or the other. The normal view is most appropriate for a dataset that is collected regularly over time, either monthly, annually, biannually, etc., although they do not have to be at such regular intervals. The inverted view is appropriate for data collections that contain several datasets for a specific time period, but the datasets differ from one time period to the next. In our example, the NHAMCS is collected on a yearly basis, therefore it is most appropriate for it to use display type 1, the normal view. SZ 1.
Example MIF:
VER 1.0
SO NEW
SC Outpatient Department
SL National Hospital Ambulatory Medical Care Survey
SS NHAMCS
ST 1996:1996
SD 1
SZ 1
SA
Syntax – SA Tabulation machine address: port number The tabulation machine is the machine that contains the data that is used in DataFerrett tabulations. Enter either the domain name or IP address of the machine that contains the data, a colon, and then the port number. If no port number is given, port 4505 will be assigned. You will most likely need to check with your system administrator who set up TheDataWeb servers for this information. In our example, the machine that contains the data has a domain name of sippda.census.gov and is accessed through port 4505. Since it is port 4505, we could leave the port off ( and the colon).
Example MIF:
VER 1.0
SO NEW
SC Outpatient Department
SL National Hospital Ambulatory Medical Care Survey
SS NHAMCS
ST 1996:1996
SD 1
SZ 1
SA sippda.census.gov:4505
SX
Syntax – SX Extraction machine address : port The extraction machine is the machine that contains the data that is used to download extracts out of DataFerrett. Enter either the domain name or IP address of the machine that contains the data, a colon, and then the port number. If no port number is given, port 4505 will be assigned. You will most likely need to check with your system administrator who set up TheDataWeb servers for this information. Usually the tabulation and extraction machines will be the same, but sometimes they are on different machines, or the same machine but are accessed through different ports. In our example, both machines are the same.
Example MIF:
VER 1.0
SO NEW
SC Outpatient Department
SL National Hospital Ambulatory Medical Care Survey
SS NHAMCS
ST 1996:1996
SD 1
SZ 1
SA sippda.census.gov:4505
SX sippda.census.gov:4505
SI [Optional]
Syntax – SI Inherited Dataset Name You must use an existing dataset’s exact name (SC token), within same Data Collection. This token is not commonly used. It is for data collections that have datasets that can be used along with one or more of the other datasets in that collection. For example, the basic monthly CPS data can be used with any of the CPS supplements. So, when defining the supplement you would define the SI token with the basic monthly dataset’s name, which is simply Basic. This is defined in the basic monthly MIF as the SC token. The SI token means that the supplement dataset “inherits” the basic monthly dataset’s items so users can tabulate or extract items from both datasets for a given time period. In our example, the NHAMCS outpatient department dataset DOES NOT inherit any other dataset. Therefore, we do NOT include the SI token.
Example MIF:
VER 1.0
SO NEW
SC Outpatient Department
SL National Hospital Ambulatory Medical Care Survey
SS NHAMCS
ST 1996:1996
SD 1
SZ 1
SA sippda.census.gov:4505
SX sippda.census.gov:4505
SU [Optional]
Syntax – SU fully qualified URL
This token defines a URL that contains a dataset’s logo image. The URL must be fully defined, including the http:// at the beginning. The logo image gets used in the browser window that returns the links to the extracted files when using the extraction function in DataFerrett. This logo can provide DataFerrett users with dataset and dataset provider information when they extract data.
In our example, a logo gif URL is defined.
Example MIF:
VER 1.0
SO NEW
SC Outpatient Department
SL National Hospital Ambulatory Medical Care Survey
SS NHAMCS
ST 1996:1996
SD 1
SZ 1
SA sippda.census.gov:4505
SX sippda.census.gov:4505
SU http://ferret.bls.census.gov/images/hamodban.jpg
We have now fully defined the Dataset Level Metadata for the NHAMCS Outpatient Department 1996 dataset. Next we will look at defining the Item Level Metadata for this same dataset, before going on to how to define updates.
The remaining segments define the items contained in a dataset and can come
in any order throughout the rest of the MIF file. The possible segments are:
Global, New, Update, Timeframe, or Stop.
The Global segment contains item level metadata tokens and should appear at the top of the MIF file after the Survey Level segment, although this is not required. A global value can be changed within the file by entering a new global value at the point in which the new value should begin. Also, global values are overridden by an individual value for any specific item. The following tokens are valid Global tokens and the definitions can be found in the item-level metadata section below (GC is the global version of C token, GW is the global verion of the W token, etc.):
GC
GT
GW
GX
GY
GZ
GI
GE
GG
The remaining four segments describe the operation to be performed on item-level metadata contained within the segment. Each segment is preceded by the GO global that indicates the Global Operation the segment should perform. The following operations are valid values for the GO token:
Item-level metadata appears between operation tokens (GO) and must begin with the M token to indicate a new item-level variable is beginning. The definition of a variable is complete when the ingestion system finds another M token, any global token, or the end-of-file, whichever comes first. The tag [Optional] at the end of the token description indicates optional tokens.
M Item (variable) name or mnemonic.
S Short description or English label
(limit 60 characters, cannot contain quotation marks)C Concept or topic label
T Time of item (when it began), e.g. Jan 1994 for January 1994, and when it ended (if it has). If it continues into the future, there is no stopdate, e.g. Jan 1994:Jun 1994W Suggested weight variable name<sup>1</sup>
(e.g. BASEWGT), OR Yes (if item IS a weight), OR NONE (if there is no suggested weight)X Security Level, use entire word as follows: Public SponsorY Variable type abbreviation as follows: E = Edited U = Unedited W = Weighting R = Recode X = Allocation flag T = Topcoded S = Sample Control G = Geography P = Replicate Weights N = Public UseZ Data type abbreviation as follows: B = Binary (numeric) Cx = Character (user defines x to be the length of field from 1 to 255) T = Military time (HH:MM) Ix.y = Implied decimal (user defines the x to be the
total length of value including
the decimal and y is the number
of digits toright of decimal For example: I10.4 = Implied decimal (5 digits to the left and 4
digits to right of decimal) I5.2 = Implied decimal (2 digits to left and 2 to right)
Note: the value line should then
contain minimum and maximum value
with a decimal.e.g. for Z I5.2, the value line should be V 0.00:99.99
Fx.y = Floating point with precision
as described for Implied decimalsN Unit type abbreviation as follows: ABS = Absolute number AVG = Average DOL = Dollars MIN = Minutes PCT = Percent SQM = Square miles TH$ = Thousands of dollars RTE = RateG Geography indicator 0 = Not a geography item 1 = Geography item, but selection not required 2 = Geography item, selection requiredThe following item is optional, but strongly recommended for items that
are not allocation flags or topcoded items:
V Value with description. (limit 100 characters) Each value
line should have a V at the beginning,
but DO NOT put a V at the beginning
of a line if the description wraps to
the next line.V 1 Male
V 2 Female
or for a continuous range variable,
the minimum and maximum values MUST be
separated by a colon:V -1 Blank
V 0:99 Years
or for a continuous range with
decimals (e.g. Z I10.4):V 0.0000:99999.9999
IF the data contains a blank,
the value should be defined exactly as:V Blank
:L:
Long description. There may be a multiple
line description. There must be a :L: on
the lines before and after the description.
:L:
The following items are optional:</p>P CD-ROM or ascii file data start
and end positions, e.g. P 15 16U Universe description (Universe descriptions MUST follow Long description for an item.):A: Attachment type (e.g. Edit Specs, Recode Specs,
Instrument Specs, Sampling, User
Note,etc.) followed by the URL of the text, beginning on the next line, e.g. http://www.census.gov/mydir/myfile.htm (Please note:
there is no :A: line after
the URL line.)B Synonyms(Multiple words should either be be listed separately, or comma delimited). (e.g. B men B boy B gender B women B girl or B men, boy, gender, women, girl)I Iteration group size for longitudinal data (i.e. variable repeats 12 times, then 12 would be the group size)._____________________________
1When entering new
variables, please place variables used as Suggested Weight variables with their
corresponding information at the top of the file.
Email:
dsd_ferrett@census.gov