The Bristol Observatory

              Incorporated in 1997 by John A. Pandiani, Ph.D., Sociologist 

                                             and Steven M. Banks, Ph.D., Mathematician

Home                 

 

Probabilistic

Population

Estimation

   

Web-based

PPE

 

Publications

and

Presentations 

Staff

Clients

and

Projects

File Upload  

(Clients only,

requires username

and password)

Three Approaches to Cross Data Set Analysis 1

John A. Pandiani and Steven M. Banks

State mental health agencies are increasingly being called upon to describe the relationships between their caseload and the caseloads of other human service agencies.  Such descriptions generally involve cross data set analysis.  Information on caseload overlap derived from cross data set analysis may be useful for measuring levels of access to mental health services.  When involvement with other state agencies follows mental health treatment cross data set analysis can provide powerful measures of treatment outcomes.  Measures of treatment outcomes that can be derived from cross data set analysis include employment, trouble with the law, hospitalization, and economic dependency among many others. 

Cross data set analysis has two important advantages over alternative approaches.  First, because the methodology relies on existing databases, it does not require the commitment of substantial amounts of staff time or financial resources to collect new data.  Second, cross data set analysis can support evaluation of changes in systems of care that occurred in the past, and provide baseline data for evaluating current or anticipated changes in systems of care wherever basic client information resides in electronic databases.

There are three general approaches to determining the number of people shared across data sets.  Direct record linkage, the most widely used approach, relies on unique person identifiers that are shared across data sets to determine the number of people in both datasets (caseload overlap).  The second method uses constructed identifiers to approximate unique person identifiers by pooling semi-unique attributes of people.  Finally, Probabilistic Population Estimation measures caseload overlap, without personal identifiers, using calculations based on probability theory.  

Direct Record Linkage

Direct record linkage relies on pre-existing unique identifiers such as social security number that are contained in multiple data sets.  Caseload overlap is determined by linking records that contain the same identifier from two (or more) data sets and counting the number of identifiers that appear in both (or all) data sets. 

Direct record linkage is considered by many to be the gold standard for measuring the number of people occurring in multiple datasets.  In addition to providing measures of the magnitude of caseload overlap, direct record linkage also identifies the individuals who appear in multiple data sets.  Direct record linkage technology, however, assumes that no identifier is used by more than one person and no person has more than one identifier.  In the real world there are conditions under which these assumptions are violated.  Direct record linkage also raises issues of personal privacy and the confidentiality of medical records.  The United States General Accounting Office Report on record linkage and privacy 2 provides an excellent overview of methods for reducing threats to privacy while using direct record linkage such as data masking, list inflation, and third party vendors.   Another major disadvantage of unique identifiers systems is the cost and time required to build these systems and the continuing cost and time required to assure that all individuals are assigned one and only one unique identifier.   

Constructed identifiers

Constructed identifiers use combinations of personal attributes such as name fragment, date of birth, race/ethnicity, and gender to build pseudo-unique person identifiers. The unique identifier used by the State of Vermont for the federal substance abuse Treatment Episode Data Set (TEDS) uses a constructed identifier that includes the first three letters of the client's first name, the first three letters of the client's mother's maiden name, and the client's date of birth.  Caseload overlap is determined by linking records from two (or more) person level data sets and counting the number of identifiers that appear in both (or all) data sets based on the occurrence of identical constructed identifiers. This method has the advantage of being based on widely available data items that exist in multiple data sets.

There are two risks associated with using constructed identifiers for record linkage.  Constructed identifiers can link records for different people (false positive) or can fail to link records for the same individual (false negative).  The risk of each type of error changes with the number of data elements used in the constructed identifier.  The frequency of the false positives can be understood based on the probability of coincidences.3  This mathematical observation shows that when two data sets include a large number of individuals, each of whom has a small probability of falsely matching, there is a high probability of false matches for the group as a whole.  Constructed identifiers based on date of birth and gender only are likely to produce false positive matches in data sets with more than 40,000 people. 

The frequency of false negatives can be understood using the volatility of name fragments as an example.  Names frequently have a number of forms (Robert/Bob for instance), and may change over time (marriage/divorce, etc.).  Constructed identifiers based on larger numbers of data elements (e.g. the previously mentioned TEDS identifier which includes name fragments) fail to link records from two data sets that describe the same individuals at an uncertain rate.  The utility of constructed identifiers is related to the relative benefits and detriments associated with the inclusion of different numbers and different types of data elements in the constructed identifier.  

Probabilistic Population Estimation

Probabilistic Population Estimation is a statistical procedure for measuring the number of people represented in data sets that do not share unique person identifiers.4  Probabilistic Population Estimation  provides valid and reliable estimates of the number of people represented in both databases, but does not reveal who the people are.  For this reason, Probabilistic Population Estimation is not suitable for clinical and case management applications that require the identification of individuals.

Probabilistic Population Estimation has important advantages over alternative approaches to cross data set analysis.  First, the personal privacy of individuals and the confidentiality of medical records are assured because Probabilistic Population Estimation does not depend upon information that identifies specific individuals.  Second, Probabilistic Population Estimation can measure caseload overlap between databases that do not include the same unique person identifiers.  Third, Probabilistic Population Estimation provides precise confidence intervals for determining the statistical significance of findings.

___________________________________

1   Adapted from a chapter on "Cross Agency Data Integration For Evaluating Systems of Care" by Steven Banks, John Pandiani, Monica Simon and Nancy Nagel in the forthcoming volume Outcomes for Children and Youth with Behavioral and Emotional Disorders and Their Families edited by Michael Epstein, Krista Kutash, and Al Duchnowski.

2    United States General Accounting Office (GAO) (2001, April). Record Linkage and Privacy: Issues in Creating New Federal Research and Statistical Information. GAO-10-126SP.

3    Diaconis, P. & Mosteller, F. (1989). Methods of Studying Coincidences. Journal of the American  Statistical Assoc., 84, 853-861.

4    Banks, S.M. & Pandiani, J.A. (2001). Probabilistic Population Estimation of the Size and Overlap of Data Sets Based on Date of Birth. Statistics in Medicine, 20, 1421-1320.   

 

 

 

The Bristol Observatory
521 Hewitt Road
Bristol, VT 05443

bristob@together.net

(802) 453-7070 / (802)453-5061 Fax

For questions or comments about this web site, send e- mail to webmaster@TheBristolObservatory.com  
Copyright © 2000 The Bristol Observatory
Web design by Fern Hill (last update 7/23/08)