The role of data science in contact tracing
Contact tracing using smartphone apps has been one of the most effective containment strategies for COVID-19 to date, as evidenced by the early successes of South Korea and China. The technique relies on mobile Bluetooth and/or GPS technology to identify when individuals have come in contact with an infected person. Automated alerts can be pushed to the user’s phone and to public health authorities, allowing the individual to quarantine themselves before spreading the disease further. Countries that have deployed contact tracing apps with high adoption rates have fared much better overall, evidencing fewer cases and deaths.
The U.S. is also pursuing contact tracing. In April, Andrew Cuomo, Governor of New York, called for an army of contact tracers. Tech companies, like Google and Apple, have announced the launch of interoperable APIs and developer tools for contact tracing app development. PK has also piloted its own real-time contract tracing app called PKontact as early as April.
With new approaches and technologies, there almost always follow new and different challenges. From a data science perspective, contact tracing presents issues with data privacy and analysis. However, these challenges can be overcome by applying data science thinking to help design the whole process.
Where does data science take place in contact tracing? Does it take place where we write scripts in R, Python, Scala or other languages to read and manipulate data? Does it take place in machine learning and deep learning algorithms that detect potential risks and predict future trends? Does it take place in the design of the underlying structure of a data storage solution?
Data science covers many different aspects of data related work. However, they all have a common starting point: data acquisition.
Data acquisition is an important step in most data science projects, and yet is often overlooked. It affects other data efforts, including the scalability of the underlying structure, the data storage efficiency, the data cleaning logic, the dimensionality and granularity in data analyses, the business logic and questions translated into algorithms, to name a few.
“Good data science requires considerations from the overall data ecosystem.” – Heather Harris, Intelligence and Analytics Practice Director at PK
Because of how the work duties are split, data scientists focusing on different aspects of data science do not always have the opportunity to start their collaboration as early as the data acquisition stage, which can get in the way of greater insights. Take contact tracing apps as an example. Quite a few companies and organizations are developing these apps, but there are still no global standards, and no national standards in the U.S., for the overall data process. The public health authorities and experts working in the data science field need to start rethinking the ecosystem and consider what data points the proposed data acquisition approach in contact tracing will bring in and what limitations the disparate data sources will pose to their work.
From decentralized data collection to centralized analysis
As this MIT article points out, current contact tracing still relies heavily on human tracers. There’s a need for labor-intensive work, such as phone follow-ups and in-person interviews. Contact tracing apps may not be able to replace all the manual, investigative work related to contact tracing, but they certainly help the health authorities identify cases and reach people sooner.
Contact tracing with mobile technology that is capable of identifying when and where an individual was infected, as well as how many people they have been in contact with, enhances advanced spatial analyses and predictive analytics, making them more accurate and efficient. In order to achieve the latter, the different manual and automatic data sources need to be carefully integrated to reduce duplications, omissions and other scenarios that might lead to mixed results.
It is not a strange concept in the data science world that either during data acquisition or later in data integration, a uniqueidentifier needs to be created. In contact tracing, it is equally important to apply such identifiers in manual datasets.
Associating unique identifiers with individuals and activities allows analysts to easily merge datasets later on while still storing data at the most granular level. It then enables systematic data cleaning and interpolation. See the figure for a mapping diagram for how tables can be joined with unique identifiers as primary and foreign keys.
It is almost certain that such analytic attempts will raise data privacy and data security concerns. Even with decentralized data collection methods and the most stringent security approaches, the idea of sharing any personal health information, even if it’s anonymized, can feel like a violation of personal privacy to many, and having one’s data analyzed by authorities can feel dangerous to others. However, dire situations call for extreme measures. If Covid-19 continues to run rampant, people may become more comfortable with elevating public health over data privacy.
As some countries emerge better off than others during COVID-19, the success of places like Kerala in India that have deployed aggressive contact tracing may provide a proven path forward. It reinforces how crucial it is for people to work together and take extra precautions, including the pursuit of contact tracing on a larger scale than we have ever seen or imagined.
Finding a balance between data usability and privacy
The most important lines of defense in public safety and data security are the same: People. Stay-at-home orders are useless if no one follows them. Technologies can have negative impacts if they’re abused. Which is why we need mature data strategies that can help ensure user adoption of contact tracing apps, while also preventing against misuse by bad actors.
The key to successful contact tracing is to follow rigorous approaches and still be able to find the balance between data usability and data privacy. One of the most common ways to accomplish this balance is called data masking.
Data masking, also known as data obfuscation, is a technique that hides personally identifiable information and sensitive data to reduce the risk of a data breach. Although encryption is often used to transform and hide data to serve similar purposes, data masking is an irreversible method, as opposed to encryption which is reversible with a key and is also easier for computers to consume.
Today, data masking techniques are being used for a variety of sensitive information, including:
- PII (Personally Identifiable Information)
- PHI (Protected Health Information)
- PCI-DSS (Payment Card Industry Data Security Standard)
- Intellectual Property
CDC’s preliminary evaluation of digital contact tracing tools also suggests that data anonymization is required when handling PII and PHI data. While data masking ensures the anonymity of data, it is still able to keep the complexity and unique characteristics of data, for example, the length of string values. It is producing an anonymized, “fake” and yet authentic dataset for the analytic environment. The following is an example of Microsoft Dynamic Data Masking in a SQL Server:
The challenges caused by disparate tracing methods will take time to resolve. To approach data problems like these requires expert knowledge of available technologies, as well as data, itself, not to mention good teamwork and an adept cross-functional approach. In other words, we need a comprehensive data framework that guides us through each step from data acquisition, to data examination, to data cleaning, to data analysis, and finally to data modeling and forecasting. Without a universal standard, we must work together to find a common ground, as the disease doesn’t follow country or state boundaries.
Check out our webinar on PK’s contact tracing app to see it in action.
About the Author
Chan Shan is a data scientist in PK’s Intelligence and Analytics practice. She has extensive consulting experience in multiple industries and has worked on data science projects for Microsoft, Petco and T-Mobile. She was born and raised in China and moved to the US. a few years ago, which gives her a unique perspective on the data science field in the post-globalization era.Tags: COVID-19, Data Science