|
DataFerrett: Browser for the TheDataWebTheDataWeb is a collaboration between multiple agencies |
|
INSIDE TheDataWeb:
TheDataWeb Home What is TheDataWeb DataFerrett Home What is DataFerrett TheDataWeb Browser: DataFerrett Datasets Available TheDataWeb Services TheDataWeb Publisher & Server Setup FAQ TheDataWeb HelpDesk: Toll Free: 866-437-0171 DataFerrettTeam Email: for Comments, Questions, or Errors |
|
Visual of Metadata Interface File DocumentationMetadata Interface File DocumentationThe following describes the layout of the DataFerrett Metadata Interface File (MIF). Metadata is the information that defines a dataset and the variables, or items, found within that dataset. This includes the name of the Data Collection, the name of each dataset within that collection, the time period for the dataset, and the name, description, and values of each item in the dataset, as well as other information. The MIF is an ASCII file that is used to populate the DataFerrett metadata database. The DataFerrett metadata database contains all of the information passed to the users through the DataFerrett data access tool application. Each piece of metadata is denoted by a 2-3 character delimiter (or token). Note that the delimiters should start in column one followed immediately by at least one space. The delimiters surrounded by colons allow for multiple line entries without having the delimiter at the beginning of each line. They should be at the beginning and ending of the text. A MIF file is split up into two segments: Dataset Level Metadata and Item Level Metadata. We will look at each segment separately. The Dataset Level Metadata segment defines the dataset and must come first. Before we define the tokens and information needed, we will discuss exactly what we mean by a “dataset.” In DataFerrett, there can be up to three levels of a “dataset” defined. The highest level is what we refer to as a Data Collection. A collection is comprised of one or more lower level datasets. For example, a survey such as the Current Population Survey (CPS) is comprised of a set of basic questions asked every month, and specialized “supplements” that are asked every so often. Therefore, the data collection is the Current Population Survey, and the basic monthly questions and each supplement are each considered a dataset within that collection. There are only two levels in this “dataset” example. Some data collections may consist of three levels, basically an intermediate level between the “collection” and the “dataset.” You can consider it a “sub-collection” or “sub-grouping” within a data collection. For example, the Survey of Income and Program Participation (SIPP) is a data collection. SIPP is a longitudinal survey that is broken into “panels” and the data is released for each panel. But each in each panel, there are “core” questions that are asked during each time period of the panel, and also “topical module” questions that are asked only at certain points in time. Therefore, in this case, the panels are an intermediate level between the collection, SIPP, and the datasets, Core and Topical Modules. The last part of defining a dataset, is the time period of the dataset. This is typically a year, or month and year, for the period for which the data was collected. The data collector, or provider, determines the time period for a dataset. When defining datasets in a MIF, you can only enter information for one time period in each MIF. However, DataFerrett has the ability to use data from different time periods for the same dataset, as long as the data items are defined the same way from one time period to the next. This will be discussed in more detail later. Now we will briefly describe each Dataset Level Metadata MIF delimiter, or token, and then look at each of them in more detail. Each token for Dataset Level Metadata contains two or three letters beginning with an S. The following are valid Dataset Level tokens. The tag [Optional] at the end of the token description indicates that the token is not required. All other tokens are required. Also please note that a MIF can contain comments which are defined by a # at the beginning of the line. VER Version number (version number of metadata publishing system, must be the very first line in MIF, currently 1.0) SO Dataset operation (NEW or UPDATE) SC Dataset Name (limit 255 characters) SL Data Collection long name (limit 255 characters) SS Data Collection short name (limit 12 characters) SB Intermediate Level Name [Optional] (limit 255 characters) ST Dataset time frame, startdate:stopdate (must have start date and stop date, e.g. Jan 2000:Jan 2000 or 2001:2001) SD Dataset data category (1=microdata, 2=aggregate data) SZ Display type in DataFerrett (1=Normal(time at lowest level), 2=Inverted(dataset at lowest level)) SA Tabulation machine address and port (either domain name or IP address, if no port given, 4505 will be assigned) SX Extraction machine address and port (either domain name or IP address, 4505 will be assigned) SN Longitudinal data (YES=longitudinal) [Optional] SI Inherited Dataset Name [Optional] (Must use an existing dataset’s name (SC token), within same Data Collection) SU Logo image URL for the dataset [Optional] (fully qualified URL, e.g. http://www.name.com/image/datasetlogo.gif) SSM Sponsor Name [Optional] SSU Sponsor URL [Optional]SSB Sponsor Banner URL [Optional]SPM Provider Name [Optional]SPU Provider URL [Optional]SPB Provider Banner URL [Optional]SDU Document URL [Optional]Dataset Level
Tokens – A Detailed Look Throughout this section we will describe each token, then add that token to an example MIF. After all tokens have been described, we will end up with a fully defined MIF at the dataset level. For our example we will define a health dataset, namely one part of the National Hospital Ambulatory Medical Care Survey (NHAMCS). VER Syntax – VER x.y This defines which version of the metadata publishing system that you are using to publish the dataset to DataFerrett. This MUST be the first line of every MIF. The version is important because the version you are using on your local machine must match the version in use on the centralized DataWeb site when you try to publish there. Currently in use is version 1.0. Example MIF: VER 1.0 SO Syntax – SO OPERATION Valid operations: NEW, UPDATE This defines the operation that you are performing on the dataset you are defining. You use the NEW operation the very first time, and only the first time, you publish metadata for a specific dataset. Once a dataset has been published, you use the UPDATE operation, whether you are making dataset level metadata changes, or item level metadata changes. There is also an item level operation token (GO), which will be discussed in that section. Both tokens are necessary in every MIF. In our example, we will assume that we are publishing this datasets metadata for the first time. Later we will discuss the UPDATE operation in detail. Example MIF: VER 1.0 SO NEW Syntax – SC Dataset Name Size Limit – 255 characters This defines the name of the dataset at the lowest level of the data collection. For our example, the National Hospital Ambulatory Medical Care Survey is comprised of two distinct sections, each with its own set of data. Therefore, we would need a MIF for each section, where the dataset name is different, but both share the same data collection name. In this example we will define one of the sections, the Outpatient Department dataset. Example MIF: VER 1.0 SO NEW SC Outpatient Department
SL Syntax - SL Data Collection long name
Size Limit – 255 characters This defines the full name of the data collection.
Example MIF: VER 1.0 SO NEW SC Outpatient Department SL National Hospital Ambulatory Medical Care Survey SS Syntax - SS Data Collection short name
Size Limit – 12 characters This is an acronym or abbreviation for the Data Collection. It is used
by the DataFerrett system to identify the dataset when tabulating or
extracting the data.
Example MIF: VER 1.0 SO NEW SC Outpatient Department SL National Hospital Ambulatory Medical Care Survey SS NHAMCS SB [Optional] Syntax – SB Intermediate Level Name
Size Limit – 255 characters This is an optional level of data for those datasets that have a level
between the data collection and the data set. Many datasets do not
contain this middle level of definition. Our example does not need it,
so therefore we DO NOT include the token in our MIF.
Example MIF: VER 1.0 SO NEW SC Outpatient Department SL National Hospital Ambulatory Medical Care Survey SS NHAMCS ST Syntax – ST startdate:stopdate This defines the dataset time frame, which must have start date and
stop date separated by a colon, e.g. Jan 2000:Jan 2000 or 2001:2001.
When a month is part of the time period, it must be defined using the
3 letter abbreviation with the first letter in upper case. Also, there
must be a space between the month and the year. The very first time you publish information for a dataset, the start
and stop dates MUST BE THE SAME. Then, for every new time
period that you add, the start date will always stay the same and
the stop date will be changed to the next time period. This will be
discussed in more detail when the UPDATE operation is discussed. For our example, the first dataset that we are publishing is the
NHAMCS data collected for the year 1996.
Example MIF: VER 1.0 SO NEW SC Outpatient Department SL National Hospital Ambulatory Medical Care Survey SS NHAMCS ST 1996:1996 SD Syntax – SD data category # Valid entries – 1 or 2 (1=microdata, 2=aggregate data) There are basically two types of data that can be put into and
accessed through DataFerrett, microdata and aggregate data.
Microdata is data in which every record is at the unit of analysis
level and all records must be added up to get the totals for each
data item. For example, for surveys of individuals, microdata
contain records for each individual interviewed; for surveys of
organizations, the microdata contain records for each organization. Aggregate data is data which has already been summarized or
added up, usually for specific geographical units or some other unit,
such as industry classifications. In this case, each record is a
geographical unit and there is no summing needed to get the totals
for the geographies. In our example, the NHAMCS is a survey that collects the data from
every outpatient hospital visit at the hospitals in the survey sample.
Therefore, this dataset is a microdata dataset, where you must add up
the responses to all the questions in order to get the totals. So,
we will define this token as SD 1 (since 1=microdata).
Example MIF: VER 1.0 SO NEW SC Outpatient Department SL National Hospital Ambulatory Medical Care Survey SS NHAMCS ST 1996:1996 SD 1 SZ Syntax – SZ Display type # Valid entries – 1 or 2 (1=Normal, 2=Inverted) The display type refers to the way the dataset and time period
information is displayed in the available datasets tree in DataFerrett.
Display type = 1 refers to a normal tree view where the time period
of the dataset is at the lowest level). Display type = 2 refers to an
inverted tree view where the dataset name is at the lowest level,
directly below the time period. See examples. There are reasons for doing it one way or the other. The normal
view is most appropriate for a dataset that is collected regularly over
time, either monthly, annually, biannually, etc., although they do not
have to be at such regular intervals. The inverted view is appropriate for data collections that contain
several datasets for a specific time period, but the datasets differ
from one time period to the next. In our example, the NHAMCS is collected on a yearly basis, therefore
it is most appropriate for it to use display type 1, the normal view. SZ 1.
Example MIF: VER 1.0 SO NEW SC Outpatient Department SL National Hospital Ambulatory Medical Care Survey SS NHAMCS ST 1996:1996 SD 1 SZ 1 SA Syntax – SA Tabulation machine address: port number The tabulation machine is the machine that contains the data that is
used in DataFerrett tabulations. Enter either the domain name or
IP address of the machine that contains the data, a colon, and then
the port number. If no port number is given, port 4505 will be
assigned. You will most likely need to check with your system
administrator who set up TheDataWeb servers for this information. In our example, the machine that contains the data has a domain name
of sippda.census.gov and is accessed through port 4505. Since it
is port 4505, we could leave the port off ( and the colon).
Example MIF: VER 1.0 SO NEW SC Outpatient Department SL National Hospital Ambulatory Medical Care Survey SS NHAMCS ST 1996:1996 SD 1 SZ 1 SA sippda.census.gov:4505 SX Syntax – SX Extraction machine address : port The extraction machine is the machine that contains the data that is
used to download extracts out of DataFerrett. Enter either the domain
name or IP address of the machine that contains the data, a colon,
and then the port number. If no port number is given, port 4505 will
be assigned. You will most likely need to check with your system
administrator who set up TheDataWeb servers for this information.
Usually the tabulation and extraction machines will be the same,
but sometimes they are on different machines, or the same machine
but are accessed through different ports. In our example, both machines are the same.
Example MIF: VER 1.0 SO NEW SC Outpatient Department SL National Hospital Ambulatory Medical Care Survey SS NHAMCS ST 1996:1996 SD 1 SZ 1 SA sippda.census.gov:4505 SX sippda.census.gov:4505 SI [Optional] Syntax – SI Inherited Dataset Name You must use an existing dataset’s exact name (SC token), within
same Data Collection. This token is not commonly used. It is for
data collections that have datasets that can be used along with one
or more of the other datasets in that collection. For example, the
basic monthly CPS data can be used with any of the CPS
supplements. So, when defining the supplement you would define the
SI token with the basic monthly dataset’s name, which is simply Basic.
This is defined in the basic monthly MIF as the SC token. The SI
token means that the supplement dataset "inherits" the basic monthly
dataset’s items so users can tabulate or extract items from both
datasets for a given time period. In our example, the NHAMCS outpatient department dataset DOES
NOT inherit any other dataset. Therefore, we do NOT include the
SI token.
Example MIF: VER 1.0 SO NEW SC Outpatient Department SL National Hospital Ambulatory Medical Care Survey SS NHAMCS ST 1996:1996 SD 1 SZ 1 SA sippda.census.gov:4505 SX sippda.census.gov:4505
SU [Optional] Syntax – SU fully qualified URL This token defines a URL that contains a dataset’s logo image. The URL must be fully defined, including the http:// at the beginning. The logo image gets used in the browser window that returns the links to the extracted files when using the extraction function in DataFerrett. This logo can provide DataFerrett users with dataset and dataset provider information when they extract data. In our example, a logo gif URL is defined. Example MIF: VER 1.0 SO NEW SC Outpatient Department SL National Hospital Ambulatory Medical Care Survey SS NHAMCS ST 1996:1996 SD 1 SZ 1 SA sippda.census.gov:4505 SX sippda.census.gov:4505 SU http://ferret.bls.census.gov/images/hamodban.jpg We have now fully defined the Dataset Level Metadata for the NHAMCS Outpatient Department 1996 dataset. Next we will look at defining the Item Level Metadata for this same dataset, before going on to how to define updates. The remaining segments define the items contained in a dataset and can come
in any order throughout the rest of the MIF file. The possible segments are:
Global, New, Update, Timeframe, or Stop. The Global segment contains item level metadata tokens and should appear at the top of the MIF file after the Survey Level segment, although this is not required. A global value can be changed within the file by entering a new global value at the point in which the new value should begin. Also, global values are overridden by an individual value for any specific item. The following tokens are valid Global tokens and the definitions can be found in the item-level metadata section below (GC is the global version of C token, GW is the global version of the W token, etc.): GC GT GW GX GY GZ GI GE GG The remaining four segments describe the operation to be performed on item-level metadata contained within the segment. Each segment is preceded by the GO global that indicates the Global Operation the segment should perform. The following operations are valid values for the GO token:
Item-level metadata appears between operation tokens (GO) and must begin with the M token to indicate a new item-level variable is beginning. The definition of a variable is complete when the ingestion system finds another M token, any global token, or the end-of-file, whichever comes first. The tag [Optional] at the end of the token description indicates optional tokens. M Item (variable) name or mnemonic. S Short description or English label (limit 60 characters, cannot contain quotation marks)C Concept or topic label T Time of item (when it began), e.g. Jan 1994 for January 1994, and when it ended (if it has). If it continues into the future, there is no stopdate, e.g. Jan 1994:Jun 1994W Suggested weight variable name<sup>1</sup> (e.g. BASEWGT), OR Yes (if item IS a weight), OR NONE (if there is no suggested weight)X Security Level, use entire word as follows: Public SponsorY Variable type abbreviation as follows: E = Edited U = Unedited W = Weighting R = Recode X = Allocation flag T = Topcoded S = Sample Control G = Geography P = Replicate Weights N = Public UseZ Data type abbreviation as follows: B = Binary (numeric) Cx = Character (user defines x to be the length of field from 1 to 255) T = Military time (HH:MM) Ix.y = Implied decimal (user defines the x to be the total length of value including the decimal and y is the number of digits to the right of decimal For example: I10.4 = Implied decimal (5 digits to the left and 4 digits to right of decimal) I5.2 = Implied decimal (2 digits to left and 2 to right) Note: the value line should then contain minimum and maximum value with a decimal.e.g. for Z I5.2, the value line should be V 0.00:99.99 Fx.y = Floating point with precision as described for Implied decimalsN Unit type abbreviation as follows: ABS = Absolute number AVG = Average DOL = Dollars MIN = Minutes PCT = Percent SQM = Square miles TH$ = Thousands of dollars RTE = RateG Geography indicator 0 = Not a geography item 1 = Geography item, but selection not required 2 = Geography item, selection requiredThe following item is optional, but strongly recommended for items that are not allocation flags or topcoded items: V Value with description. (limit 100 characters) Each value line should have a V at the beginning, but DO NOT put a V at the beginning of a line if the description wraps to the next line.V 1 Male V 2 Female or for a continuous range variable, the minimum and maximum values MUST be separated by a colon:V -1 Blank V 0:99 Years or for a continuous range with decimals (e.g. Z I10.4):V 0.0000:99999.9999 IF the data contains a blank, the value should be defined exactly as:V Blank :L: Long description. There may be a multiple line description. There must be a :L: on the lines before and after the description. :L: The following items are optional:</p>P CD-ROM or ascii file data start and end positions, e.g. P 15 16U Universe description (Universe descriptions MUST follow Long description for an item.):A: Attachment type (e.g. Edit Specs, Recode Specs, Instrument Specs, Sampling, User Note,etc.) followed by the URL of the text, beginning on the next line, e.g. http://www.census.gov/mydir/myfile.htm (Please note: there is no :A: line after the URL line.)B Synonyms(Multiple words should either be be listed separately, or comma delimited). (e.g. B men B boy B gender B women B girl or B men, boy, gender, women, girl)I Iteration group size for longitudinal data (i.e. variable repeats 12 times, then 12 would be the group size)._____________________________ 1When entering new
variables, please place variables used as Suggested Weight variables with their
corresponding information at the top of the file. 1When entering new variables, please place variables used as Suggested Weight variables with their corresponding information at the top of the file. |
|
|
|
Last update: 5/14/04 |