AIFARMS is one of the 27 AI Institutes funded by USDA-NIFA and the NSF, the focus of which is on making foundational AI advances in agriculture and using them to ensure that future agriculture is environmentally friendly, sustainable, affordable, and accessible to diverse farming communities. Housed at the University of Illinois, AIFARMS has a wide network of collaborators that includes researchers from Michigan State University, the University of Chicago, Tuskegee University, Argonne National Laboratory, and the Donald Danforth Plant Science Center, among others.
From the very start, AIFARMS has emphasized Data Management operations as an important operation of the Institute. As a result, a Data Management working group was formed that Ana Lucic has been involved with for the past three years. The group is tasked with keeping abreast of the technological changes that are occurring that have an influence on how data is published, shared, interpreted, and analyzed. As part of this effort, the working group has created Best Practices for Dataset and Software creation that are available through this link. This guide serves as a reference tool for students, staff, and researchers as they are going through the process of dataset or software creation. Each dataset creation process can be very unique in nature and purpose which is the reason why the tutorial is general. We invite researchers to approach us with more specific questions with respect to finding the right platform on which to post the datasets, how to anonymize data before publishing it, the vocabulary to use to document the data, and how to make the data more findable, accessible, useable, and reproducible, FAIR in short.
Datasets take a long time to create and during this process, a subset of the data might be ready to be shared with other researchers, but not in a public manner, which is why it is important that researchers have an easy way to share data internally. AIFARMS has created templates for internal data sharing as well as a data portal where data created by the researchers can be hosted in part or fully and privately or publicly.
The pace of technology affects how we manage and engineer data. While large language models hold promise for how we can process or extract meaning from data, lack of documentation will likely hinder the possibilities for further enrichment or alignment of the data source. In other words, there is quite a lot that researchers need to do to facilitate alignment with other data sources, which could lead to a stronger elicited signal from the data. The example dataset that Lucic has helped document is the PigLife dataset, which is available here. The publication[1] associated with this dataset is available here. Large language and computer vision models are also being used to facilitate the new methods for identifying previously unseen and unlabeled categories of objects and turning them into useful information, such as in the following recent publication[2] led by Garvita Allabadi, a computer science PhD student affiliated with AIFARMS.
Another crucial aspect of the data work at the Institute represents a collaboration with scientists from other AI Institutes, such as AgAID, AIFS, AIIRA, and ICICLE, among others. Participation in several working groups that deal with data engineering challenges and practices, such as the Data Engineering for AI applications working group, the IEEE Smart Ag working group, and AgBioData Phenotypic data standardization and management is [1]an aspect that helps Lucic stay abreast of the changes in the field. It is through these partnerships and active collaborations that decisions are made on how to move through the complex landscape in the area of agriculture, which spans images from hyperspectral, multispectral, LIDAR camera systems, video, physical and biological samples, and surveys, among others.
The search for better alignment and expression of existing data sources represents an important aspect of Lucic’s work on her other projects within the Algorithm & Software group at the Illinois Applied Research Institute (ARI). For example, for the Correlation of Mental Health Screening and Accelerated Growth Trajectories project, the team is analyzing retrospectively the accelerated growth trajectories against the psychosocial functioning score, such as the one provided through the Pediatric Symptom Checklist-17 questionnaire, as well as against the Social Determinants of Health. In this instance, data alignment from different time points was a prerequisite for the analysis. Finding the right way to align the data is a challenge in itself, but also one of the fascinating aspects of the project. A different project looks at how we can elicit a better and stronger signal from data by trimming the peripheral content that typically surrounds textual objects, such as digitized textual objects that contain related to but also peripheral material in relation to the core content of the works.
Throughout her work at ARI, Lucic has the opportunity to interact with small, medium, and large datasets, and to observe first-hand what kind of challenges these sources continue to have, but also to explore the opportunities that latest technological inventions can bring about, such as automatic data enrichment and analysis, the creation of a machine readable datasheet and data management plan, as well as the creation of an on overall readier for analysis and FAIR-er dataset.
[1] Li, J., Hu, X., Lucic, A., Wu, Y., Condotta, I. C. F. S., Dilger, R. N., Ahuja, N., & Green-Miller, A. R. (2024).
Promote computer vision applications in pig farming scenarios: high-quality dataset, fundamental models, and comparable performance1. Journal of Integrative Agriculture. https://doi.org/https://doi.org/10.1016/j.jia.2024.08.014
[2] Allabadi, G., Lucic, A., Wang, YX. et al. Learning to Detect Novel Species with SAM in the Wild. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02234-0