DataFerrett.census.gov link

DataFerrett: Browser for the TheDataWeb  

A collaboration between the U.S. Census Bureau and the Centers for Disease Control     
 INSIDE TheDataWeb:

 TheDataWeb Home

 What is TheDataWeb

 DataFerrett Home

 What is DataFerrett

 TheDataWeb Browser:  DataFerrett

 Datasets Available

 TheDataWeb Services

 TheDataWeb Publisher  & Server Setup

 FAQ

 TheDataWeb HelpDesk:
 Toll Free: 866-437-0171

 DataFerrettTeam Email:
 dsd_ferrett@census.gov

 Use our Online Form
 for Comments,
 Questions, or Errors

Metadata Interface File Documentation

The following describes the layout of the DataFerrett Metadata Interface File (MIF). Metadata is the information that defines a dataset and the variables, or items, found within that dataset.  This includes the name of the Data Collection, the name of each dataset within that collection, the time period for the dataset, and the name, description, and values of each item in the dataset, as well as other information. The MIF is an ASCII file that is used to populate the DataFerrett metadata database.  The DataFerrett metadata database contains all of the information passed to the users through the DataFerrett data access tool application.

Each piece of metadata is denoted by a 2-3 character delimiter (or token). Note that the delimiters should start in column one followed immediately by at least one space. The delimiters surrounded by colons allow for multiple line entries without having the delimiter at the beginning of each line. They should be at the beginning and ending of the text.

A MIF file is split up into two segments: Dataset Level Metadata and Item Level Metadata.  We will look at each segment separately.

Dataset Level Metadata

The Dataset Level Metadata segment defines the dataset and must come first. Before we define the tokens and information needed, we will discuss exactly what we mean by a “dataset.”  In DataFerrett, there can be up to three levels of a “dataset” defined.  The highest level is what we refer to as a Data Collection.  A collection is comprised of one or more lower level datasets.  For example, a survey such as the Current Population Survey (CPS) is comprised of a set of basic questions asked every month, and specialized “supplements” that are asked every so often.  Therefore, the data collection is the Current Population Survey, and the basic monthly questions and each supplement are each considered a dataset within that collection.  There are only two levels in this “dataset” example.

Some data collections may consist of three levels, basically an intermediate level between the “collection” and the “dataset.”  You can consider it a “sub-collection” or “sub-grouping” within a data collection.  For example, the Survey of Income and Program Participation (SIPP) is a data collection.  SIPP is a longitudinal survey that is broken into “panels” and the data is released for each panel.  But each in each panel, there are “core” questions that are asked during each time period of the panel, and also “topical module” questions that are asked only at certain points in time.  Therefore, in this case, the panels are an intermediate level between the collection, SIPP, and the datasets, Core and Topical Modules.

The last part of defining a dataset, is the time period of the dataset.  This is typically a year, or month and year, for the period for which the data was collected.  The data collector, or provider, determines the time period for a dataset.  When defining datasets in a MIF, you can only enter information for one time period in each MIF.  However, DataFerrett has the ability to use data from different time periods for the same dataset, as long as the data items are defined the same way from one time period to the next.  This will be discussed in more detail later.

Now we will briefly describe each Dataset Level Metadata MIF delimiter, or token, and then look at each of them in more detail.  Each token for Dataset Level Metadata contains two or three letters beginning with an S. The following are valid Dataset Level tokens. The tag [Optional] at the end of the token description indicates that the token is not required. All other tokens are required.  Also please note that a MIF can contain comments which are defined by a # at the beginning of the line.

VER Version number (version number of metadata 
publishing system, must be the very 
first line in MIF, currently 1.0)
 
SO Dataset operation (NEW or UPDATE)
SC Dataset Name (limit 255 characters)
SL Data Collection long name (limit 255 characters)
SS Data Collection short name (limit 12 characters)
SB Intermediate Level Name [Optional] 
(limit 255 characters)
ST Dataset time frame, startdate:stopdate 
(must have start date and stop date, 
e.g. Jan 2000:Jan 2000 or 2001:2001)
SD Dataset data category
 (1=microdata, 2=aggregate data)
SZ Display type in DataFerrett
 (1=Normal(time at lowest level), 
 2=Inverted(dataset at lowest level))
SA Tabulation machine address and port 
(either domain name or IP address, 
if no port given, 4505 will be assigned)
SX Extraction machine address and port 
(either domain name or IP address, 
4505 will be assigned)
SN Longitudinal data (YES=longitudinal) [Optional]
SI Inherited Dataset Name
[Optional] (Must use an existing dataset’s name 
(SC token), within same Data Collection)
SU Logo image URL for the dataset
[Optional] (fully qualified URL, 
e.g. http://www.name.com/image/datasetlogo.gif)
SSM  Sponsor Name  [Optional]
SSU  Sponsor URL [Optional]
SSB  Sponsor Banner URL [Optional]
SPM   Provider Name [Optional]
SPU   Provider URL [Optional]
SPB   Provider Banner URL [Optional]
SDU   Document URL [Optional]
 

Dataset Level Tokens – A Detailed Look

Throughout this section we will describe each token, then add that token to an example MIF.  After all tokens have been described, we will end up with a fully defined MIF at the dataset level.  For our example we will define a health dataset, namely one part of the National Hospital Ambulatory Medical Care Survey (NHAMCS).

VER

Syntax – VER  x.y

This defines which version of the metadata publishing system that you are using to publish the dataset to DataFerrett.  This MUST be the first line of every MIF.  The version is important because the version you are using on your local machine must match the version in use on the centralized DataWeb site when you try to publish there.  Currently in use is version 1.0.

            Example MIF:

            VER 1.0

SO

Syntax – SO  OPERATION

Valid operations:  NEW,  UPDATE

This defines the operation that you are performing on the dataset you are defining.  You use the NEW operation the very first time, and only the first time, you publish metadata for a specific dataset.  Once a dataset has been published, you use the UPDATE operation, whether you are making dataset level metadata changes, or item level metadata changes.  There is also an item level operation token (GO), which will be discussed in that section.  Both tokens are necessary in every MIF.

In our example, we will assume that we are publishing this datasets metadata for the first time.  Later we will discuss the UPDATE operation in detail.

            Example MIF:

VER 1.0

 

            SO  NEW

 

 

SC

Syntax – SC  Dataset Name

Size Limit – 255 characters

This defines the name of the dataset at the lowest level of the data collection.  For our example, the National Hospital Ambulatory Medical Care Survey is comprised of two distinct sections, each with its own set of data.  Therefore, we would need a MIF for each section, where the dataset name is different, but both share the same data collection name.  In this example we will define one of the sections, the Outpatient Department dataset.

            Example MIF:

VER 1.0

 

            SO  NEW

SC  Outpatient Department

           

SL
Syntax - SL Data Collection long name

Size Limit – 255 characters

This defines the full name of the data collection.  

            Example MIF:

VER 1.0

 

            SO  NEW

SC  Outpatient Department

            SL  National Hospital Ambulatory Medical Care Survey

 

SS
Syntax - SS  Data Collection short name

Size Limit – 12 characters

This is an acronym or abbreviation for the Data Collection.  It is used 
by the DataFerrett system to identify the dataset when tabulating or 
extracting the data.

            Example MIF:

VER 1.0

 

            SO  NEW

SC  Outpatient Department

            SL  National Hospital Ambulatory Medical Care Survey

            SS  NHAMCS

 
 
SB  [Optional]
 
Syntax – SB Intermediate Level Name

Size Limit – 255 characters

This is an optional level of data for those datasets that have a level
between the data collection and the data set.  Many datasets do not 
contain this middle level of definition.  Our example does not need it, 
so therefore we DO NOT include the token in our MIF.
 

            Example MIF:

VER 1.0

 

            SO  NEW

SC  Outpatient Department

            SL  National Hospital Ambulatory Medical Care Survey

            SS  NHAMCS

 
 
ST 
 
Syntax – ST  startdate:stopdate
 
This defines the dataset time frame, which must have start date and 
stop date separated by a colon, e.g. Jan 2000:Jan 2000 or 2001:2001.  
When a month is part of the time period, it must be defined using the 
3 letter abbreviation with the first letter in upper case. Also, there 
must be a space between the month and the year. 
 
The very first time you publish information for a dataset, the start 
and stop dates MUST BE THE SAME.  Then, for every new time 
period that you add, the start date will always stay the same and 
the stop date will be changed to the next time period.  This will be 
discussed in more detail when the UPDATE operation is discussed. 
  
For our example, the first dataset that we are publishing is the 
NHAMCS data collected for the year 1996.
 

            Example MIF:

VER 1.0

 

            SO  NEW

SC  Outpatient Department

            SL  National Hospital Ambulatory Medical Care Survey

            SS  NHAMCS

            ST  1996:1996

 
 
SD 
 
Syntax – SD data category #  
 
Valid entries – 1  or 2   (1=microdata, 2=aggregate data)
 
There are basically two types of data that can be put into and
accessed through DataFerrett, microdata and aggregate data.  
Microdata is data in which every record is at the unit of anlysis 
level and all records must be added up to get the totals for each 
data item.  For example, for surveys of individuals, microdata 
contain records for each individual interviewed; for surveys of 
organizations, the microdata contain records for each organization.
  
Aggregate data is data which has already been summarized or 
added up, usually for specific geographical units or some other unit, 
such as industry classifications.  In this case, each record is a 
geographical unit and there is no summing needed to get the totals 
for the geographies.
 
In our example, the NHAMCS is a survey that collects the data from 
every outpatient hospital visit at the hospitals in the survey sample.  
Therefore, this dataset is a microdata dataset, where you must add up 
the responses to all the questions in order to get the totals.  So, 
we will define this token as  SD  1 (since 1=microdata).
 

            Example MIF:

VER 1.0

 

            SO  NEW

SC  Outpatient Department

            SL  National Hospital Ambulatory Medical Care Survey

            SS  NHAMCS

            ST  1996:1996

            SD  1

 
 
SZ 
 
Syntax – SZ  Display type # 
 
Valid entries – 1  or 2   (1=Normal, 2=Inverted)
 
The display type refers to the way the dataset and time period 
information is displayed in the available datasets tree in DataFerrett. 
Display type = 1 refers to a normal tree view where the time period 
of the dataset is at the lowest level). Display type = 2 refers to an 
inverted tree view where the dataset name is at the lowest level, 
directly below the time period.  See examples.
 
There are reasons for doing it one way or the other.  The normal 
view is most appropriate for a dataset that is collected  regularly over 
time, either monthly, annually, biannually, etc., although they do not 
have to be at such regular intervals.  
 
The inverted view is appropriate for data collections that contain 
several datasets for a specific time period, but the datasets differ 
from one time period to the next.
 
In our example, the NHAMCS is collected on a yearly basis, therefore 
it is most appropriate for it to use display type 1, the normal view.  SZ  1.
 

            Example MIF:

VER 1.0

 

            SO  NEW

SC  Outpatient Department

            SL  National Hospital Ambulatory Medical Care Survey

            SS  NHAMCS

            ST  1996:1996

            SD  1

            SZ  1

 
 
SA 
 
Syntax – SA  Tabulation machine address: port number
 
The tabulation machine is the machine that contains the data that is 
used in DataFerrett tabulations.  Enter either the domain name or 
 IP address of the machine that contains the data, a colon, and then 
 the port number.  If no port number is given, port 4505 will be 
assigned.  You will most likely need to check with your system 
administrator who set up TheDataWeb servers for this information. 
 
In our example, the machine that contains the data has a domain name 
of sippda.census.gov and is accessed through port 4505.  Since it 
is port 4505, we could leave the port off ( and the colon).

            Example MIF:

VER 1.0

 

            SO  NEW

SC  Outpatient Department

            SL  National Hospital Ambulatory Medical Care Survey

            SS  NHAMCS

            ST  1996:1996

            SD  1

            SZ  1

            SA  sippda.census.gov:4505

 
 
SX 
 
Syntax – SX  Extraction machine address : port 
 
The extraction machine is the machine that contains the data that is 
used to download extracts out of DataFerrett.  Enter either the domain 
name or IP address of the machine that contains the data, a colon, 
and then the port number.  If no port number is given, port 4505 will 
be assigned.  You will most likely need to check with your system 
administrator who set up TheDataWeb servers for this information.  
Usually the tabulation and extraction machines will be the same, 
but sometimes they are on different machines, or the same machine 
but are accessed through different ports.
 
In our example, both machines are the same.

            Example MIF:

VER 1.0

 

            SO  NEW

SC  Outpatient Department

            SL  National Hospital Ambulatory Medical Care Survey

            SS  NHAMCS

            ST  1996:1996

            SD  1

            SZ  1

            SA  sippda.census.gov:4505

            SX  sippda.census.gov:4505

 
 
SI  [Optional]
 
Syntax – SI  Inherited Dataset Name  
 
You must use an existing dataset’s exact name (SC token), within 
same Data Collection.  This token is not commonly used.  It is for 
data collections that have datasets that can be used along with one 
or more of the other datasets in that collection.  For example, the 
basic monthly CPS data can be used with any of the CPS 
supplements. So, when defining the supplement you would define the 
SI token with the basic monthly dataset’s name, which is simply Basic. 
This is defined in the basic monthly MIF as the SC token.  The SI 
token means that the supplement dataset "inherits" the basic monthly
dataset’s items so users can tabulate or extract items from both 
datasets for a given time period.
 
In our example, the NHAMCS outpatient department dataset DOES 
NOT inherit any other dataset.  Therefore, we do NOT include the 
SI token.
 

            Example MIF:

VER 1.0

 

            SO  NEW

SC  Outpatient Department

            SL  National Hospital Ambulatory Medical Care Survey

            SS  NHAMCS

            ST  1996:1996

            SD  1

            SZ  1

            SA  sippda.census.gov:4505

            SX  sippda.census.gov:4505

 
 

SU [Optional]

Syntax – SU  fully qualified URL

This token defines a URL that contains a dataset’s logo image. The URL must be fully defined, including the http:// at the beginning. The logo image gets used in the browser window that returns the links to the extracted files when using the extraction function in DataFerrett.  This logo can provide DataFerrett users with dataset and dataset provider information when they extract data.

In our example, a logo gif URL is defined.

            Example MIF:

VER 1.0

 

            SO  NEW

SC  Outpatient Department

            SL  National Hospital Ambulatory Medical Care Survey

            SS  NHAMCS

            ST  1996:1996

            SD  1

            SZ  1

            SA  sippda.census.gov:4505

            SX  sippda.census.gov:4505

            SU  http://ferret.bls.census.gov/images/hamodban.jpg

 

We have now fully defined the Dataset Level Metadata for the NHAMCS Outpatient Department 1996 dataset.  Next we will look at defining the Item Level Metadata for this same dataset, before going on to how to define updates.

 

 

Item Level Metadata

The remaining segments define the items contained in a dataset and can come in any order throughout the rest of the MIF file. The possible segments are: Global, New, Update, Timeframe, or Stop.

The Global segment contains item level metadata tokens and should appear at the top of the MIF file after the Survey Level segment, although this is not required. A global value can be changed within the file by entering a new global value at the point in which the new value should begin. Also, global values are overridden by an individual value for any specific item. The following tokens are valid Global tokens and the definitions can be found in the item-level metadata section below (GC is the global version of C token, GW is the global verion of the W token, etc.):

GC 
GT
GW
GX
GY
GZ
GI
GE
GG

The remaining four segments describe the operation to be performed on item-level metadata contained within the segment. Each segment is preceded by the GO global that indicates the Global Operation the segment should perform. The following operations are valid values for the GO token:

  • NEW
    • The variable is inserted into the repository. An error occurs if a variable with the same mnemonic and timeframe exists.
  • UPDATE
    • Updates everything except the timeframe for an existing variable. If the timeframe (T) token is specified, it is ignored by the UPDATE operation. An error occurs if the variable does not exist.
  • TIMEFRAME
    • Performs the same function as UPDATE but modifies the timeframe as well. Do not use this operation unless you need to modify a variable's timeframe (T). The end timeframe (E) token is required in addition to item-level metadata as specified below.
  • STOP
    • Stops a variable. Only the mnemonic (M), and end timeframe (E) token are required.

Item-level metadata appears between operation tokens (GO) and must begin with the M token to indicate a new item-level variable is beginning. The definition of a variable is complete when the ingestion system finds another M token, any global token, or the end-of-file, whichever comes first. The tag [Optional] at the end of the token description indicates optional tokens.

M  Item (variable) name or mnemonic.
S  Short description or English label 
   (limit 60 characters, cannot contain 
   quotation marks)
C  Concept or topic label
T  Time of item (when it began),
   e.g. Jan 1994 for January 1994, 
   and when it ended (if it has).
   If it continues into the future, 
   there is no stopdate,
   e.g. Jan 1994:Jun 1994
W  Suggested weight variable name<sup>1</sup> 
   (e.g. BASEWGT), OR
   Yes (if item IS a weight), OR
   NONE (if there is no suggested weight)
X  Security Level, use entire word as follows:
   Public
   Sponsor
Y  Variable type abbreviation as follows:
   E = Edited
   U = Unedited
   W = Weighting
   R = Recode
   X = Allocation flag
   T = Topcoded
   S = Sample Control
   G = Geography
   P = Replicate Weights
   N = Public Use
Z  Data type abbreviation as follows:
   B = Binary (numeric)
   Cx = Character (user defines x to be 
        the length of field from 1 to 255)
   T = Military time (HH:MM)
   Ix.y = Implied decimal 
          (user defines the x to be the   
          total length of value including   
          the decimal and y is the number  
          of digits toright of decimal
   For example:
   I10.4 = Implied decimal 
           (5 digits to the left and 4  
           digits to right of decimal)
   I5.2 = Implied decimal 
          (2 digits to left and 
              2 to right)
      Note: the value line should then  
      contain minimum and maximum value   
      with a decimal.e.g. for Z I5.2, 
      the value line should be 
          V 0.00:99.99
   Fx.y = Floating point with precision  
       as described for Implied decimals
N  Unit type abbreviation as follows: 
   ABS = Absolute number
   AVG = Average
   DOL = Dollars
   MIN = Minutes
   PCT = Percent
   SQM = Square miles
   TH$ = Thousands of dollars
   RTE = Rate
G  Geography indicator
   0 = Not a geography item
   1 = Geography item, but selection not required
   2 = Geography item, selection required
 
The following item is optional, but strongly 
recommended for items that 
are not allocation flags or 
topcoded items:
 
V  Value with description. 
   (limit 100 characters) Each value  
   line should have a V at the beginning, 
   but DO NOT put a V at the beginning 
   of a line if the description wraps to 
   the next line.
   V  1 Male
   V  2 Female
   or for a continuous range variable,  
   the minimum and maximum values MUST be 
   separated by a colon:
   V  -1 Blank
   V  0:99 Years
   or for a continuous range with  
   decimals (e.g. Z I10.4):
   V  0.0000:99999.9999
   IF the data contains a blank, 
   the value should be defined exactly as:
   V  Blank
:L:
Long description. There may be a multiple 
line description. There must be a :L: on 
the lines before and after the description.
:L:
 
The following items are optional:</p>
P  CD-ROM or ascii file data start  
   and end positions, e.g. P 15 16
U  Universe description 
   (Universe descriptions MUST follow 
   Long description for an item.)
:A:  Attachment type 
     (e.g. Edit Specs, Recode Specs,  
     Instrument Specs, Sampling, User  
     Note,etc.) followed by the URL of 
     the text, beginning on the next 
     line, e.g. http://www.census.gov/
     mydir/myfile.htm (Please note:  
     there is no :A: line after 
     the URL line.)
B  Synonyms(Multiple words should either be 
   be listed separately, or comma 
   delimited).
   (e.g. B men
         B boy
         B gender
         B women
         B girl
   or
         B men, boy, gender, women, girl)
I  Iteration group size for longitudinal 
   data (i.e. variable repeats 12 times, 
   then 12 would be the group size).
 
_____________________________

1When entering new variables, please place variables used as Suggested Weight variables with their corresponding information at the top of the file.

What is DataFerrett | Install DataFerrett | Install Permission | Users' Guide | GoTo DataFerrett

Last update: 5/14/04