Manual for Data Processing
PROCESSING THE DATA
This chapter is written for survey coordinators, data processing
experts and technical resource persons. It provides information on how
Prepare for processing the data
Set up a system for managing data processing
Carry out data entry
Edit the data and create a ‘clean’ data file for analysis
Produce tabulations with the indicators
Archive and distribute data
The MICS5 dataprocessing system is designed to deliver the first
results of a survey within a few weeks after the end of fieldwork.
This chapter contains information that will help you to undertake the
planning and advance preparation that will make this goal a reality.
The chapter begins by giving you an overview of the MICS5
dataprocessing system. It then discusses each of its components in
detail, providing references to supplemental sources of information
where appropriate. It closes with a set of three checklists that will
help you make the processing of your survey data a success.
The reason that the MICS5 dataprocessing system can achieve such
rapid turnaround time is because data is processed in tandem with
survey fieldwork. Data for each cluster is stored in a separate data
file and is processed as soon as the questionnaires are returned from
the field. This approach breaks data processing down into discrete
segments and allows it to progress while fieldwork is ongoing. Thus,
by the time the last questionnaires are finished and returned to
headquarters most of the data have already been processed.
Processing the data by clusters is not difficult, but it does require
meticulous organization. The dataprocessing system can be divided
into three phases: preparation, primary data processing and secondary
data processing. Each of these phases is summarized in the sections
that follow and each has its own checklist at the end of the document.
The MICS5 Data Processing System
Preparation for Data Entry
The goal of preparing for the dataentry phase is to be ready to begin
shortly after the fieldwork commences. The preparation phase involves
the following steps:
Obtaining computer equipment and setting up a dataprocessing room
Identifying and recruiting appropriate personnel
Adapting computer programs to the countryspecific questionnaire
Setting up a system for managing the questionnaires and data
Primary Data Processing
The goal of primary data processing is to produce clean, edited data
files. Primary data processing involves the following steps:
Entering all questionnaires for a cluster onto a data file
Producing fieldcheck tables
Checking the structure of the data file
Entering the data a second time and then verifying the data file
Backing up the checked and verified data file
Performing secondary editing on the data file
Backing up the edited, or final, data file.
The flow of primary data processing is summarized in the flow chart on
the previous page. Note carefully that structure checking, the
verification of data entry and secondary editing are iterative
procedures that are repeated until all problems are resolved or
determined to be acceptable.
Secondary Data Processing
The goal of secondary data processing is to produce analysis data
files and to create the MICS5 standard tables. Secondary data
processing involves the following steps:
Concatenating all cluster data files into one data file
Imputing missing information
Exporting the data to the SPSS software
Calculating sample weights
Recoding variables to simplify analysis
Computing wealth index
Creating the tables required to analyse the data
Archiving and distributing the data files.
Personnel and Infrastructure
The dataprocessing team for a MICS5 survey includes four types of
personnel: a questionnaire administrator, dataentry operators,
secondary editors and a dataprocessing supervisor. Each position has
distinct responsibilities and combining them is likely to damage the
quality of your data.
The questionnaire administrator (or office editor) checks and
organizes questionnaires as they arrive from the field. When a cluster
arrives at the dataprocessing office, he/she checks that all of the
questionnaires are present and ready to be entered. If there are
missing questionnaires, he/she must resolve the problem with the help
of the fieldwork team (the precise steps that the questionnaire
administrator must take are detailed later).
The dataentry operators enter the data. They should have prior
dataentry experience and be familiar with the questionnaires. One way
to accomplish this is to have the dataentry operators attend the
interviewers’ training. Before beginning data entry, a separate 23
day training session must be held to acquaint dataentry operators
with the dataentry program and the rhythm of the dataprocessing
system. By the end of the training, the dataentry operators should be
comfortable with the dataentry program and aware of their daily
responsibilities. The required number of dataentry operators depends
upon the number of available computers and is discussed in detail
The secondary editors investigate and resolve complex inconsistencies
discovered by the secondary editing program. They must have an
excellent understanding of the questionnaires and the goals of the
survey. Editing guidelines are provided in MICS Manual Chapter ‘Data
Editing Guidelines’ to aid them in the secondary editing process. A
typical survey will require one or two secondary editors.
The dataprocessing supervisor is a critical member of the
dataprocessing team. He/she adapts the model programs to suit her/his
country’s questionnaires and oversees all dataprocessing tasks. The
dataprocessing supervisor should have experience managing data
processing for a largescale survey or census, an excellent
understanding of the questionnaire, and programming skills in the
CSPro and SPSS software packages. The dataprocessing supervisor
should be available on a fulltime basis during the period that the
data are being entered, edited and tabulated.
The dataprocessing supervisor should be identified early in the
planning stages of the survey so that he/she can be involved in the
revision of the MICS5 questionnaire. This person should be consulted
to ensure that the coding schemes used in the questionnaire are
consistent and unambiguous and that all of the identification
information needed is included. The dataprocessing supervisor must
also be able to assist in final revisions to the questionnaire based
on experience gained while entering questionnaires from the pretest.
Computer Equipment and Other Hardware
Below is a list of equipment necessary for data processing:
The dataprocessing supervisor’s computer
A secondary storage devices (for example, a portable USB devices
for operators to transfer files to the dataprocessing supervisor,
if network is not established)
Toner cartridges/printer ribbons
Uninterruptible power supplies (UPS)
The dataentry computers as a minimum should have Pentium processors,
Windows 98 or higher, at least 32 megabytes of RAM, 1 gigabyte or more
of free harddisk space, and be networked together. The number of
dataentry computers needed to process the survey depends on the size
of the sample, the number of hours a dataentry operator will work
each week, the space available and the timetable for the survey. To
obtain an estimate of the number of computers needed for data entry,
you should use “Fieldwork Duration, Staff, Data Processing and Supply
Estimates Template” excel file.
The supervisor’s computer should have a faster processor, Windows 98
or higher, at least 64 megabytes of RAM, 1 gigabyte or more of free
hard disk space, a secondary storage device and should be networked to
the dataentry computers.
Uninterruptible power supplies and surge protectors are essential if
the country in which you are working suffers from power outages. Green
pens should be used whenever a member of the dataprocessing team
modifies the data on a questionnaire. The green ink distinguishes
these changes from the original data recorded by the interviewer (in
blue ink) and any changes made by the fieldwork team (in red ink).
The standard programs for processing MICS5 surveys were developed in
CSPro 5.0 and SPSS. CSPro, which has been used to process both surveys
and censuses, was developed collaboratively by the United States
Census Bureau, ORC Macro International and SerPro Ltda. It can be
downloaded free of charge from the website of the US Bureau of Census.1
SPSS is a commercial software package that is made available to the
Implementing Agencies through UNICEF as part of the Technical
Assistance framework. SPSS can also be purchased through many software
Separate rooms are required for data entry and data editing. The
dataentry room should be large enough so that each dataentry
operator has space for her/his computer and the questionnaire on which
he/she is working. There should be desks or tables for working and
sufficient electrical outlets. The room should be cool, well lit and
as free from dust and humidity as possible. In countries with hot
climates, this requires that the room be airconditioned. An
uninterruptible power supply should be connected to each computer. If
power outages are likely to be frequent or prolonged, another
emergency power supply, such as a generator, is necessary.
The dataediting room is for the questionnaire administrator and the
secondary editors. It, too, should be cool and well lit, and there
should be sufficient space for the editors to review questionnaires.
Ideally, the editing room will contain sufficient shelves or cupboards
to store the questionnaires in an organized fashion. If the
questionnaires cannot be stored in the editing room, then they should
be stored nearby and be easily accessible since they will be needed at
various stages throughout processing. Be careful not to underestimate
the amount of space that will be needed to store the thousands of
questionnaires that you will have in the office by the end of the
Adapting The Standard Programs
As outlined in the “Guidelines for the Customisation of MICS
Questionnaires”, the model MICS5 questionnaire must be adapted to the
situation in each country. This means that the model dataentry,
editing and tabulation programs must also be modified to be consistent
with the changes made in the questionnaire. The more changes that are
made to the model questionnaire, the more time must be allocated for
adapting and testing the programs. For example, if new questions are
added to the questionnaire, corresponding additions must be made in
the dataentry, editing and tabulation programs.
This process will be significantly easier if the question numbering in
the model questionnaire is maintained. If questions are added, a
letter should be added to the existing numbering (for example, a
question inserted between WS4 and WS5 should be numbered WS4A).
Similarly, if questions are deleted, the remaining questions should
not be renumbered. In addition, when coding categories are added to
those in the model questionnaire, they should be added to the end of
the existing list, leaving the other codes intact. The adaptation of
dataentry and editing programs should be completed prior to the
pretest. Questionnaires from the pretest should be entered and
edited using the programs. Following these instructions will serve two
purposes. It will reveal problems in the coding and skip patterns in
the questionnaires as well as any errors in the programs. Once the
pretest has been completed and the questionnaire finalized, final
changes can be made to the programs. Subsequent sections give basic
guidance on modifying the model data dictionaries and the model CSPro
applications. A more detailed summary of the contents of the CSPro
applications is provided in separate documents.
Even if you are not adding questions to the model questionnaires, the
model data dictionaries and applications contain certain items that
must be updated (for example, the acceptable range for the date of
interview, the acceptable range for the cluster number, etc.). These
items are necessarily countryspecific and must be completed by you.
Thus, even if your country uses the model questionnaire, you will have
to adapt the standard programs.
The Data Dictionaries
In the Multiple Indicator Cluster Survey, groups of related questions
(for example, on education, contraceptive use and immunization) are
collected into modules that are then collected into questionnaires
(that is, for the household, individual women, individual men and for
children under five). In CSPro, dictionaries are used to describe this
data structure: a group of related variables (questions) comprises a
record (module), and a group of records comprises a level
(questionnaire). These are stored in a dictionary file (extension: dcf).
In addition to the data dictionary, forms linked to the dictionary are
used for data entry. There is usually one form for each record. The
forms are stored in a forms file (extension: fmf). The dcf and fmf
files can be modified directly. The best way to do this is to open the
forms file in CSPro. This will give you access to the data dictionary
and the forms together and ensure that the two remain synchronized. It
is advisable to keep a backup of the model data dictionary and forms
file for reference.
There are four types of MICS5 questionnaires. The Household
Questionnaire contains three units of analysis: the household, the
household members and the insecticide treated nets. Further,
questionnaire for Individual Women contains four units of analysis:
women, births, daughters and siblings. The questionnaire for
Individual Men and the Questionnaire for Children Under Five
correspond to a single unit of analysis: a man and a child,
respectively. All of the questionnaire types are stored in mics5.dcf
Identification Variables and Levels
In CSPro, every questionnaire must have a series of variables that
uniquely identifies it. For example, a household is identified by its
cluster number and household number. The variables that identify a
questionnaire are known as the identification variables. Table 2 below
lists the questionnaire types and their identification variables.
Questionnaire Types and Their Identification Variables
As you can see from the table, women, men and children have the same
identification variables. Since each household member is listed on a
separate line in the household listing, no two women, men or children
will have the same line number, even if they are in the same
household. Thus, combined with cluster number and household number,
line number uniquely identifies a woman, child or man.
In a CSPro dictionary, a level is defined by a set of identification
variables. In the MICS5 dictionary, there are two levels: households
and individuals (that is, eligible women, eligible men and eligible
children). Households are the first level while women, men and
children are the second level. This hierarchical structure is natural
since in the MICS5 questionnaire every woman, man or child belongs to
a household while a given household may have many women, men and
The women’s questionnaire, men’s questionnaire and children’s
questionnaire are stored on the same level because each applies to a
household member. The dataentry application contains logic that skips
forms pertaining to a men or a child when entering a woman’s
questionnaire and skips forms pertaining to women or men when entering
a child’s questionnaire. Thus, although women’s, men’s and children’s
questionnaires are all stored as leveltwo cases, they have no common
variables except the identification variables.
The data dictionary was designed to reflect the modular structure of
the MICS5 questionnaires. Each module is stored in its own record
(exception: Female Genital Cutting module which has two records
because of its unusual structure) in MICS5.dcf and each record has a
form (or two, in case of the Female Genital Cutting module) associated
with it in entry.fmf. Thus, if your country does not use a particular
module, you can remove it by deleting its record and its form (and
removing any extra logic that references it from the dataentry
The modules available for the Household Questionnaire (with the
module’s code(s) listed in parentheses) are: Household Information
Panel (0HH), Household Listing Form (0HL), Education (0ED), Selection
for child labour/child discipline (0SL), Child Labour (0CL), Child
Discipline (0CD), Household Characteristics (0HC), Insecticidetreated
Nets (0TN), Indoor Residual Spraying (0IR), Water and Sanitation
(0WS), Handwashing Facility (0HW), and Salt Iodization (0SI).
The modules available for the Questionnaire for Individual Women are:
Women’s Information Panel (0WM), Woman's Background (0WB), Access to
Mass Media and Use of Information/ Communication Technology (0MT),
Fertility (0CM), Birth history (0BH),
Desire for Last Birth (0DB), Maternal and Newborn Health (0MN),
Postnatal health checks (0PN), Illness Symptoms (0IS), Contraception
(0CP), Unmet Need (0UN), Female Genital Mutilation/Cutting (0FG and
0FC), Domestic violence (0DV), Marriage/Union (0MA), Sexual Behaviour
(0SB), HIV/AIDS (0HA), Maternal mortality (0MM), Tabacco and Alcohol
Use (0TA) and Life satisfaction (0LS).
The modules available for the Questionnaire for Individual Men are:
Men’s Information Panel (MWM), Man's Background (MWB), Mass Media and
Information Tehnology (MMT), Fertility (MCM), Domestic violence (MDV),
Marriage/Union (MMA), Sexual Behaviour (MSB), HIV/AIDS (MHA),
Circumcision (MMC), Tabacco and Alcohol Use (MTA) and Life
The modules available for the Questionnaire for Children Under Five
are: UnderFive Child Information Panel (0UF), Child's Age (0AG),
Birth Registration (0BR), Early Childhood Development (0EC),
Breastfeeding and dietary intake (0BF), Immunization (0IM),
Vaccinations at health facility (0HF), Care of Illness (0CA) and
Variable Naming Conventions
Variables are named for the questionnaire module in which they are
located and the number of the question whose response they contain.
For example, question 4 in the Household Listing is stored in a
variable named HL4. Some questions are split into two or more parts,
with the separate parts identified by a unique letter. Each part of
such questions is stored in a separate variable. The names of these
separate variables include the letters that distinguish the parts of
the question. For example, question 4 of the Anthropometry module has
two parts. The first part of this question is stored in the variable
AN4, and the second part is stored in AN4A.
Some questions have two or more parts to the response categories.
These questions are stored in a single variable and the response
categories are defined as subitems. When these questions concern
dates, the letters ‘d’ (for day), ‘m’ (for month) and ‘y’ (for year)
are appended to the base variable’s name to create the name of the
subitems. In question 1 of the Woman's Background Characteristics,
for example, the woman’s month and year of birth are required. Her
response is stored in wb1, which has two subitems: wb1m and wb1y.
Some questions have a structure in which the first part of the
response is the form of the response and the second part is the
response. These questions are stored in a single variable and the form
and response are defined as subitems. The name of the subitem
storing the form of the response is the name of the variable with the
letter ‘u’ (for units) appended to it, while the name of the subitem
storing the response is the name of variable with the letter ‘n’ (for
number) appended to it. For example, question 25 in the Maternal and
Newborn Health module records how long after birth a child was first
given breastmilk. The respondent may answer in hours or days. The
response is stored in the variable mn25 with subitems mn25u and mn25n.
Multiple Response Questions and Alphanumeric Variables
There are a number of questions that allow for multiple responses.
These questions are distinguished on the questionnaire by alphanumeric
response codes (that is, the letters A through Z). In the data
dictionary, the response to a multiple response question is stored in
an alphanumeric variable whose length equals the maximum number of
potential responses. These are the only alphanumeric variables in the
dictionary. Each alphanumeric variable has one subitem for each
response code on the questionnaire. The name of one of these subitems
is the variable’s name plus the response code that subitem
represents. For example, the second question in the Maternal and
Newborn Health module records all of the individuals from whom a woman
received antenatal care before her last birth. The potential response
codes are A, B, C, F, G and X. The variable mn2 is therefore six
characters long and there are six subitems: mn2a, mn2b, mn2c, mn2f,
mn2g and mn2x.
The model dictionaries use standard coding for certain responses. We
will first discuss coding conventions for numeric variables. The
response ‘Other’ is always coded as a 6 with leading 9s. Inconsistent
responses are always coded as a 7 with leading 9s. The response
‘Doesn’t know’ is always coded as an 8 with leading 9s. Questions with
a missing response (that is, the interviewer did not record a response
to an applicable question) are always coded as a 9 with leading 9s.
Questions that are not applicable to a respondent are always coded as
a blank. Table 3 below summarizes the standard coding conventions.
Summary of Standard Coding Conventions
Because the codes 6 through 9 are reserved for special use, any
question that requires more than six response categories should have
2digit response categories with leading zeros (for example, 01, 02,
03, 04, 05, 06, 07, 96, 97, 98 and 99).
For alphanumeric variables, the response ‘Other’ is always coded as X,
the response ‘Doesn’t know’ is always coded as Z, a missing value is
always coded using the question mark character (?), and not applicable
is coded as a blank.
Most of the questions in the MICS5 questionnaires have defined
response ranges. The ranges are defined for variables in the
dictionary MICS5.dcf. CSPro checks during data entry that any value
entered in a variable is within that variable’s defined ranges. CSPro
allows for a large number of ranges for each variable, so questions
with nonconsecutive response ranges (for example, 18, 96, 98 and 99)
should be defined using several ranges (for example, 16, 96, 98 and
99, instead of 199). While dictionary ranges are useful for checking
simple ranges, more complicated or conditional ranges (for example,
consistency between day and month in a date variable) should be
checked in the dataentry or editing applications.
The Dataentry Application
The dataentry application is a long and complex program. Space
limitations prevent it from being described in any detail in this
chapter. Instead, this section will concentrate on some important
general issues about dataentry application.
The MICS5 questionnaires make abundant use of skips. Skips are
instructions on the questionnaire that tell the interviewer to skip
all the questions between the current question and a question later on
in the questionnaire. Skips on a questionnaire must be matched by
skips in the corresponding dataentry program. Skips in a dataentry
program define the dataentry path. CSPro strictly enforces the
dataentry path whenever the ‘skip to’ or ‘skip to next’ commands are
If a dataentry operator enters a value for a variable that is
inconsistent with previously entered information, it is useful to
display an error message. This error message should explain the nature
of the problem and provide any information that might help resolve the
inconsistency. In CSPro, the errmsg function displays an error message
with userdefined text whenever it is called. The error messages for
the dataentry program are numbered and stored in the file entry.mgf.
The text, number and inconsistencies that lead to each of these
messages being displayed are listed in “Data Editing Guidelines”, as
guidelines for resolving them.
You should review your questionnaire to determine if any of the
questions that have been added require checking for consistency. If
they do, you should add logic to check their consistency in the
dataentry program, the editing program, or both. When you add a
consistency check, be sure to add a corresponding error message to the
dataentry or editing message file. Also, if you add error messages,
make sure you do not use an existing error message number.
Some error messages are followed by a reenter command that returns to
the field that is being entered. This forces the entry operator to
address the error before advancing. Because the dataentry operator
will at times be required to enter corrections, careful supervision is
necessary. When you add your own error messages, consider carefully
whether you want to force the dataentry operator to resolve the
problem before advancing. If this is the case, follow your error
message with a reenter command.
The dataentry application checks that alphanumeric variables are
correctly entered. It performs four checks on each alphanumeric
variable. First, it checks that the entered value contains only codes
that are listed on the questionnaire (that is, it performs a range
check). Second, it checks that the responses are entered in
alphabetical order (that is, ACG and not GAC). Third, it checks that
if the ‘Doesn’t know’ or ‘No one’ codes (generally the letter ‘Y’) are
included in the response, then no other response is present (that is,
it will not allow the response ACY). Fourth, it checks that if the
missing code (‘?’) is included in the response, then no other response
is present (that is, it will not allow the response AC?).
The dataentry application also rearranges the values entered in
alphanumeric variables so that each response is stored in the location
that defines its subitem. For the variable ‘mn2’, for example, the
response ACG will be rearranged to A C G , where there is one blank
each between A and C and C and G and three blanks after G.
A nice feature of CSPro is that it allows programs to define their own
functions. Such functions are known as userdefined functions and can
be useful. In particular, they allow one to avoid rewriting frequently
used code. Userdefined functions are always defined at the top of a
CSPro application. The dataentry application entry.app contains 24
userdefined functions. You do not need to modify these functions (except
function vdvalid), but you must understand what they do if you are to
understand the dataentry application.
The valid function checks whether a variable’s value is one of the
special values: inconsistent, doesn’t know, missing or not applicable.
If the value of a variable is not applicable, the natozero function
changes it to ‘0’, allowing it to be added to another variable (for an
example of its use, see procedure cm10). Function notEq checks for
inequality between two values by previously treating not applicable
variable as ‘0’. The badspecial function ensures that the special
answers for questions that include both unit and number are consistent
with one another (for example see procedure db3n) .
There are three user defined functions that concern the birth history
(validyr, afterint and ndjlba). The validyr ensures that a year
variable has a valid year value. It is like valid function except it
takes into account 4 digit years. The AfterInt function checks if the
recorded date if after date of interview. The ndjlba function is
slight modification of built in adjlba function which is described in
the table 4 below.
The next seven userdefined functions (zscoef, dabs, zspct, zseval,
zscr, zsanth and agemth) are used in the Questionnaire for Children
Under Five to calculate the anthropometry scores that are found at the
end the Anthropometry module. The agemth function is called to
calculate the child’s age in months. The zsanth function is then
called. This function calls zseval, zscr and zspct. The function
zseval calls zscoef, and zspct calls dabs. You will only encounter
these functions in the anthropometry variables, and if you encounter
them you will know that they are calculating and then checking
The code in the agemth function calculates the age of the child in
months. Because anthropometry is highly sensitive to age, the age of
the child must be based on the child’s age in days. The code first
calculates the number of days that have elapsed between the beginning
of the year and a child’s birth. It then calculates the number of days
that elapsed between the beginning of the year and the date of
interview. Finally the number of days in the years between the year of
birth and the year of interview is added to the number of days since
the beginning of the year until the date of interview. The difference
between these two numbers of days is the child’s age in days. This is
then converted into the child’s age in months by dividing by 30.4375
(the average number of days in a month over four years). Because of
the need for accuracy, the child’s age in months is calculated to two
The vdvalid, vdoi and vdob functions check that vaccination dates
entered in the Immunization module are consistent, are not after the
date of interview and are not before the date of birth, respectively.
The vacgiven function checks if the vaccination is given or not. The
endmess (short for ‘end message’) function displays a message at the
end of a questionnaire that asks the dataentry operator whether
he/she wants to review the current questionnaire or continue to the
next one. Then, the alphachk function performs the checks on
alphanumeric variables detailed in the previous subsection.
Three user defined functions: clearlabels, setnet and checknet are
used in the Insecticide Treated Nets module. The setnet prepares the
array of possible household members sleeping under particular net for
display. The checknet check validity of responses for persons sleeping
under net, while clearlabels function clears array of labels that was
created and used by setnet.
Dates and Century Month Code
The model programs (including the dataentry application) use century
month codes (CMC) for most dates. The CMC for a date is the number of
months since December 1899. For example, the CMC for January 1900 is
1; the CMC for March 2000 is 1203. The CMC for a date is calculated as
follows: subtract 1900 from the date’s year, multiply that number of
years by 12, and then add the number of the date’s month to the
product. For example, the CMC for March 2000 is calculated as
(20001900) x 12 + 3.
The dataentry application uses four functions to simplify working
with century month codes. Two of these functions, setlb and setub,
calculate the lower and upper bounds, respectively, for the CMC of the
date of an event. The other two functions, adjlba and adjuba, adjust
the lower and upper bounds, respectively, of the CMC of the date of an
event (that is, the birthday of a child) when an age is also
specified. Table 4 below summarizes these functions.
CSPro Functions for Simplifying Work with Century Month Codes
lcmc setlb (month, year, minimum);
The function’s arguments are a month, year and a minimum CMC. If both
year and month are valid, the CMC is calculated and returned. If year
is not valid, minimum is returned. If month is not valid, the CMC for
January of year is returned.
ucmc setub (month, year, maximum);
The function’s arguments are a month, year and a maximum CMC. If both
year and month are valid, the CMC is calculated and returned. If year
is not valid, maximum is returned. If month is not valid, the CMC for
December of year is returned.
t adjlba (lcmc, ucmc, di, di, age);
if t 12 and cage < 15).
+ compute bf1215 0.
+ if (BF2 1) bf1215 100.
variable labels bf1215 "Children 1215 months".
The second restriction imposed by the insert file command is that if a
command continues over multiple lines, column 1 of the continuation
lines must be blank. The example below illustrates a multiline
command that respects this restriction.
Notice that the subcommands on the second and third lines are indented
two columns. (While they need only be indented one column to satisfy
the restriction, they have been indented two columns to remain
consistent with the MICS5 programming style.)
The third and fourth restrictions imposed by the include command are
that command terminators are optional and that an asterisk (*) in the
first column of a line indicates a comment line. Neither of these
restrictions affects our tabulation programs.
In addition to All Tables in Sheets.sps, there is an SPSS program that
automates the creation of analysis files. This program is named
CSPro.sps. This program should only be used when all of the component
programs have been executed and shown to work. It is useful for
recreating analysis files when a change is made to one of the file
creation programs. It ensures that all of the analysis file creation
programs will be executed in the proper order.
Archiving and Distributing Data
An important – but often neglected – component of data processing is
the archiving and documentation of data files. In addition, whether
the data files will be available widely or only within a single
institution, it is imperative to establish some guidelines for
distribution well in advance. These steps – archiving, documenting and
distributing – require an investment of time and effort. The
investment is well worth it, however, for a number of reasons:
Increasing the costeffectiveness of data collection. Collecting
survey data is a costly and labourintensive activity. In order to
justify this investment, the data collected should be exploited as
fully as possible. Making data files available to other
researchers increases the costeffectiveness of the survey
Increasing country ownership of the data and acceptance of the
results. When the data file is available for others to use, the
data collection process gains credibility. The collectors of the
data are viewed as having confidence in their findings, and the
accessibility of the data file to other researchers means that the
results can be replicated and verified by others.
Ability to examine trends. Often, published results from different
surveys are not directly comparable. For example, one survey
report may define adult respondents as those age 15 or older while
another defines adults as those age 18 or older. Without data
files, the best that can be done is an imprecise comparison of the
two sets of results. When the data files for the two surveys are
available, however, the results can often be retabulated so that
they are directly comparable, allowing conclusions about trends to
Ability to compare results within and across countries. It is
often instructive to compare results across countries, either
within a subregion or across regions. These comparisons facilitate
the identification of areas where a particular programme emphasis
is needed or where programmes have been particularly successful.
Furthermore, it may be useful to compare results from different
surveys within the same country. Sometimes this is done to
validate unexpected results (when infant mortality is lower than
expected, for example) or to assess the effects of a particular
data collection methodology (for example, relying on vaccination
cards versus mothers’ reports of vaccinations). In order to
conduct these types of analyses, researchers require access to
data files so that directly comparable figures can be calculated.
Allows indepth analysis of important subject areas by
specialists. Because of the pressure to report findings quickly,
the information presented in a survey report usually includes only
the basic findings of a survey. A welldocumented and available
data file will allow indepth analyses of particular subject areas
to be conducted, and these analyses can be done by subject
specialists who may not be on the staff of the data collection
The MICS5 analysis file should be archived, documented, anonymized and
distributed. Copies of all of the programs and files used during the
survey processing should also be archived and made available upon
request. A copy of the analysis files and their documentation should
be sent to the UNICEF Regional Office and to UNICEF New York
(Statistics and Monitoring Section). Finally, a policy and procedure
for the distribution of the data file to others should be established.
Obtain computers and other dataprocessing equipment.
Set up a dataprocessing room or space.
Recruit a dataprocessing supervisor and other personnel.
Set up a system for organizing processing activities.
Adapt programs for consistency with pretest questionnaire.
Enter and edit pretest questionnaires.
Finalize programs based on pretest experience and the final
Receive questionnaires from the field.
Assign main data entry.
Check the structure of the main dataentry file.
Assign verification data entry.
Verify that the main and verification data files are identical.
Back up the raw data file.
Produce field check tables.
Perform secondary editing.
Back up the final data file.
Impute missing and inconsistent information.
Export the data to SPSS.
Calculate and add sample weights, a wealth index and GPS data.
Run the tabulation programs.
Archive the data and develop a data distribution policy and
Send the analysis files, their documentation and all programs to
Sample Cluster Tracking Form
Number of questionnaires
Date verification complete
raw data backup
Date of editing
1 The web address is: http://www.census.gov/ipc/www/cspro/
Page | 35