Why is synthetic data interesting for national and local agencies?
Synthetic data is generated via machine learningusing generative AI to create a dataset based on real-world data. It will be mathematically identical to the original data, with the same patterns, correlations, and statistical properties.
For state and local governments, generating synthetic data solves a number of problems.
At Maryland Longitudinal Data Centerwhich is experimenting with creating synthetic datasets, executive director Ross Goldstein says there is value in using synthetic data to help protect private or sensitive information.
The center has created a proof of concept using synthetic data to train AI on educational statistics without having to access real student information. This also holds promise for other fields, such as training AI to serve citizens.
“Synthetic data could be a useful tool to provide access to data, without the risk of disclosure or misuse of the real data,” Goldstein says.
This in turn could accelerate the adoption of AI, says Kalyan Veeramachaneni, a senior researcher at MIT Information and Decision Systems Laboratory.
In AI development, “state and local governments often end up hiring third-party software consulting firms” to help develop and test applicationshe said. Agencies can safely give these partners access to synthetic data “because it’s not tied to any real person. It’s not tied to any particular real data.”
Synthetic data could also help state and local agencies train AI in situations where real-world data is scarce or difficult to obtain. It can help replace outdated information or fill in gaps where information is lacking, “reducing the burden of obtaining real-world data,” according to Gartner Study.
“Many local and state governments are asking citizens to volunteer their data to be used to develop AI models,” Veeramachaneni says. “Maybe a hundred of us volunteered our data, but most of us didn’t. Synthetic data is a way to enrich that data so that AI algorithms can extract models from it.”
LEARN MORE: Six ways AI will transform government in the coming year.
How are state and local governments using synthetic data?
Recent examples show the potential of this field.
At the Maryland Longitudinal Data System Center, for example, researchers from several universities have “collaborated on a project to determine the feasibility of creating synthetic data from longitudinal data related to education” , Goldstein explains.
“The researchers were able to create three synthetic datasets and show that they accurately represented real-world data and did not create any risk of disclosing personally identifiable information about Maryland students or workers,” he says. The findings could help policymakers and other stakeholders gain the insights needed to improve educational outcomes.
In another recent example, The Urban Institute has joined forces with the Allegheny County Department of Human Services and the Western Pennsylvania Regional Data Center drive the generation of synthetic data at the local level. The aim is to improve care coordination and drive operational improvements across a range of social services.
Local government agencies can make extensive use of synthetic data, Veeramachaneni says. electrical network managementFor example, an AI trained on synthetic data could help predict failures.
“When you give it enough training data (for example, a past event that happened at a specific time in a transformer) and you have all the data leading up to that event, the AI model automatically learns what kinds of patterns led to that event,” he explains. “Once you can create synthetic data and provide it alongside real data, it can help create more accurate models.”
Overall, the synthetic data “could be used by state and local governments to train AI in a variety of applications and services such as urban planning, public safety, emergency management, pandemic prevention and air quality monitoring,” it says. Houbing Herbert Song, IEEE member.
DISCOVER: State and local agencies are improving contact centers with AI.
What are the types of synthetic data?
Synthetic data can take several forms.
Amazon Web Servicesfor example, describes two main types of synthetic data: partial and full. Partial synthetic data represents a small portion of a real-world dataset and can be used to protect sensitive information within that larger dataset. Full synthetic data, by comparison, contains no real-world data. This data can be used when there is insufficient data available to accurately train AI.
Synthetic data types can also be defined per use case.
“You have synthetic linguistic data, from which you learn a vast linguistic corpus “And it can generate English sentences,” Veeramachaneni says. “There’s also synthetic multimedia data: images, audio, video.”
“The third type of synthetic data is tabular data, which many state and local agencies have. Examples include time-stamped voltage data on a power line, occupancy data in different residential or commercial complexes, or data on permits issued,” he explains.
“This tabular data is very complex because it contains many different data sources, different data tables connected in many different ways, and these interconnections are where all the patterns come from,” he adds. “In synthetic data, we can reproduce all of these properties and all of these patterns.”
EXPLORE: Municipalities can streamline operations with AI.
What is the impact of synthetic data in a data management strategy?
New forms of data will inevitably impact how state and local agencies manage and store their information resources.
“Synthetic data is transforming the way data is managed, just as the Internet transformed the way data is transmitted,” Song says.
As part of their data management strategiesSome IT teams are creating synthetic data platforms, “platforms that allow them to create a database containing synthetic data,” Veeramachaneni says.
The main goal is to clearly identify and track synthetic data to differentiate it from real-world data.
“Synthetic data looks like real data,” Veeramachaneni says. “We need to mark it so users know which ones are real databases and which ones are synthetic data. When users perform analytics or use it for downstream applications, they need to know they’re accessing synthetic data, not real data.”
With robust data management strategies, agencies will then be able to fully leverage synthetic data. They will be able to supplement real-world datasets where data is insufficient, ensuring citizen and employee privacy, while seeking to train AI models to support improved citizen services and operational efficiency.