Network analysis: Getting comfortable with uncertainty
by Indiana Business Research Center (IBRC)
Our previous blog post covered our work with EmployIndy as the workforce hub in Marion County and how we have built out their network of partners and funders. We are now working to build out these known education and workforce networks across the state. To do this, we are need to consolidate and integrate all known education and workforce organizations from federal, state, non-profit and commercial sources.
These sources include some of the most obvious elements about an organization: name and location, perhaps a web address, sometimes the officers. Some sources provide attributes such as when the organization was established, purpose or type of organization, an “industry” code, CDFA number or even a federal ID for non-profits. Some sources we’ve used also include information on their budget, number of employees, and whether there is a so called “parent-child” relationship (such as a headquarters and then branches). Additionally, in some cases we can integrate data that shows the grantor and grantee relationship.
A major thrust of our work on the Indiana Data Partnership (IDP) has been to test how we can apply network analysis techniques to these organization, attribute and collaborate data and ultimately, make this information available publicly through the Indiana Management Performance Hub.
As part of this work, we’ve had to confront quite a bit of uncertainty all along the way. A few of the sources of uncertainty that exist in our data include:
- Data entry errors
- Contextual bias or conflicts. That is, data originally collected for specific, mission-driven purposes, which are now being used in an application that may have never been imagined.
- Definitional differences. That is, where an organization exists as a single record in one source exists as multiple records in another source (to reflect sub-divisions within the organization, or sites in which programs within the organization operate).
- Time-based uncertainty. Does the data describe a long-term condition, or a one-time event? Does a given condition have a prescribed expiration, or will it exist indefinitely?
- Direct or indirect connections
Every data source added to our education/workforce IDP data model adds uncertainty and we continue to address these. We cannot expect to patch every “hole” in the model through manually intensive effort -- that isn’t sustainable. Additionally, every instance of manual intervention presents another opportunity for human bias to be introduced.
In terms of using the data for network analysis, the biggest uncertainty is the last one on the list above – connections between and among organizations. Statistical methods offer some solutions, such as bootstrapping, jackknifing, and empirical likelihood. The current approach we have taken to addressing uncertainty and nuance in our data model is to assign strength scores to connections between entities and collect those connections in an index. Originally, a simple yes/no was used to flag whether an organization in our integrated database of organizations was involved in workforce and education. It was clear that this would not work and we have now generated a strength continuum measure that derives from multiple shared characteristics suggesting involvement.
What can realistically be inferred from a connection between two entities in our data model which has a high score?
- Ideally, this indicates that real world collaboration is actually happening between these two entities.
- Perhaps these entities may be so strongly related to each other that in the real world they are typically considered the same entity. In these scenarios, if they are able to be detected, the data model may be able to solve for one of our major sources of uncertainty without requiring rigid accommodations in the data structures or manual intervention.
- False positive. In this case, why did the underlying data lead us to falsely imply a connection between these entities? Which assumptions failed?
What happens when our model fails to indicate collaboration between two entities that are actually real-world collaborators?
- Is this evidence of a data gap? What data sources exist that can address gaps like these?
- If no data exists to reflect this relationship, should this data be created? Does the lack of data to describe this relationship do a disservice to the stakeholders from these entities? Could they receive more funding or support if this relationship were better reported or documented?
- Is it realistically possible to measure this relationship?
- Are the real-world actors over-estimating the partnership?
As we work to build and verify a data model that can be used to perform network analysis, we will describe the many ways we’ve approached uncertainty.