Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove bad records in grscicoll #535

Open
MortenHofft opened this issue Jun 15, 2023 · 0 comments
Open

Remove bad records in grscicoll #535

MortenHofft opened this issue Jun 15, 2023 · 0 comments

Comments

@MortenHofft
Copy link
Member

MortenHofft commented Jun 15, 2023

We have never done much to remove bad records. For first impressions that might be worth doing a bit of.

In an attempt to do so, I've taken all institutions created before 2018 when we starting managing the data.
Removed everything that has a master source (i only remember that midway though). And everything that includes words like university, national park and some other stop words.

The remaining institution names was given to GPT4 which was asked to detect outliers, and rather include too many than too few.

GPT4 response
I see a mix of different types of organizations in your list, including universities, museums, botanical gardens, research institutions, and other organizations related to science and biodiversity. From your list, these are some of the entities that might be considered as outliers, although this does not mean they are irrelevant. It entirely depends on the context and purpose of your list.

  • "Industry secrets Designed for voice transcription jobs philippines Exposed": this entry seems unrelated to the list and looks like a misplaced phrase.
  • "See FMRI": this could be a note to reference Functional Magnetic Resonance Imaging (fMRI) and doesn't fit the list format.
  • "DominicusCirillus[deceased]": this looks like the name of an individual, possibly deceased, and not an organization or institution.
  • De La Villa, Francisco: this seems to be a personal name, rather than an institution or organization.
  • Francois Genier: this appears to be a personal name, not an institution or organization.
  • Neal L. Evenhuis: another personal name.
  • Global Colosseum: this seems more like a venue or event place, rather than a scientific or research institute.
  • Principality of Liechtenstein: this is a country, not an institution or organization.
  • Yamanouchi Pharmaceutical Co., Ltd.: while a legitimate organization, this is a pharmaceutical company which may not fit with the theme of research institutes, museums, or nature parks.
  • Middle School of the Third Factory Machinery: appears to be a school, which while an educational institution, may not fit in the context of the other institutions listed.
  • Timpanogos Cave National Monument: These are specific national monuments or natural sites, rather than organizations or institutions, but they could still be relevant depending on the context of your list.
  • Thierry Neef de Sainval: another personal name.
  • Sammlung Oberli: this could be referring to a private collection, and not an institution or organization.
  • Kochi Senior High School: This entry is a high school and does not fit into the mentioned categories.
  • Matuyama Girl's High School: This entry is also a high school and does not belong in the categories.
  • Nossa Senhora do Patrocinia: This name seems to refer to a Catholic patroness, not a scientific or educational institution.
  • Reservoir of Heilongtan: This seems to be a geographic location or a water body, not an institution.
  • Ontario Hydro: This is a power generation company, not an organization in your stated categories.
  • Death Valley National Monument: This is a national monument rather than a nature park or educational/scientific institution.
  • The Crown Prince's Palace: This seems to be a historic place or a royal residence, not a scientific or educational institution.
  • Verticillium dahliae from cotton: This seems to be a reference to a type of fungal pathogen in cotton, not an institution.
  • Joshua Tree National Monument: This is a national monument rather than a nature park or educational/scientific institution.
  • Marwa mokhtar Abd Rabo: This appears to be an individual's name, not an organization.
  • N.P.P: Without further context, this abbreviation could refer to numerous things, but it does not clearly refer to an organization fitting your categories.
  • Tree Nursery: A tree nursery might not fit into the categories provided as it is more related to agriculture and horticulture.
  • Fort Hays: This could be referring to a historic site or a university. If it's the former, it may not fit in the categories.
  • AGRITEC, Ltd.: This seems to be a private company, and might not fit in the categories.
  • Philip Harris Biological Ltd.: This is a company that provides equipment for science education.
  • FCT: This abbreviation could stand for a number of things, but without further information, it's hard to tell if it fits in your categories.
  • Division of Standardisation: Without more context, this could refer to various entities, but it's not clear that it belongs to your categories.
  • CAIRCC: This abbreviation could stand for a number of things, but without further context, it's unclear how it fits your categories.
  • ALGOBANK: This seems like it might be a bank or a database of some sort, but it's unclear how it fits your categories.
  • The North Carolina Memorial Hostital: This appears to be a hospital, which might not fit into your categories, also the spelling might be incorrect (should be Hospital).
  • DNA-TAX: Without more context, it's not clear what this refers to. It could be a database or tool related to genetic research.
  • Instituto Zimotecnico-Z: This doesn't provide enough information to clearly categorize it. Also, the word Zimotecnico does not seem to correspond to a known institution or term.
  • Sanofi ELF Biorecherches: This seems like it could be a pharmaceutical company or research division, which might not fit your categories.
  • James Cook Townsville: This could be referring to a city or a campus of James Cook University. If it's the former, it doesn't fit into your categories.
  • BPPT Ethanol-Single Cell Protein-Fructose Syrup Technical Unit: This seems more like a specific technical unit or project, rather than an organization.
  • ITG: It's an acronym that could stand for a variety of things, therefore, it would be good to verify this.
  • RRL , Jammu INDIA: It is not clear what this organization is, it could be a typo or an acronym.
  • R. M. Alden Research Lab: This appears to be a specific lab, not an institution or organization as such.
  • National Kurturensammlung fuer Mikroorganismen: It looks like a misspelling. It could be Kultursammlung.
  • Setor de Pesquisa Tecnologica de Sistemas de Tratamento de Efluentes Domesticos: This seems like a specific research area or sector within a larger institution, not an institution itself.
  • Aboyne Castle: This seems like a historical or tourist location, rather than an educational or research institution.
  • Literary and Philosophical Institution of Chatham: This may be more focused on literature and philosophy than natural sciences, depending on the scope of your list.
  • Atkins Jardín Botánico de Cienfuegos: Atkins is a famous botanic garden, but 'Atkins' seems to be a mistake here.
  • Alnwick Scientific and Mechanical Institution: The inclusion of 'mechanical' might make this institution less relevant to a list focused on natural sciences or biology.
  • Aix-Marseille Université: This is a comprehensive university and not specifically focused on natural sciences or biology.
  • All Saint's Church: This appears to be a religious institution, not an educational or research institution.
  • YMCA Hostel: This is a type of accommodation and doesn't seem to fit with the other institutions listed.
  • Auroville Foundation: This organization is related to a planned international city in India and may not be directly related to natural sciences or biology.
  • Beth Gordon Institute: This might be more focused on personal improvement and health, depending on the specific Beth Gordon Institute referred to.
  • Crispus Attucks High School: This is a high school and might not belong in a list with research institutes and universities unless you're also including secondary education.
  • Austin High School: Similar to the above, this is a high school.
  • Abitibi Paper Company: This is a company and might not be directly related to natural sciences or biology.
  • Whiteshell Nuclear Research Establishment: This institution is more related to nuclear research and might not fit in a list focused on natural sciences or biology.
  • County Record Office: This seems to be more related to administration and public records than scientific research or education.
  • Barton-on-Humber Grammar School: This is another school that might not belong in the list if you're focusing on research institutes and universities.
  • Willesden Borough Council: This is a governmental entity and might not belong on a list with research institutes and universities unless you're including governmental departments in general.
  • Shrewsbury School: This is another school that might not belong in the list if you're focusing on research institutes and universities.
  • San Joaquin County: This is a county and doesn't seem to fit with the other institutions listed.
  • Desert: Without further context, this entry is quite ambiguous. If it's supposed to be a nature park or a research
  • East High School: This seems to be a regular high school, not a university or research institute.
  • Turvey Abbey: While an abbey could potentially house a historical collection, if it doesn't, it might not fit the theme.
  • St. Paul's School: Similar to East High School, this appears to be a regular school.
  • St. Bartholomew's Hospital: A hospital doesn't usually fall under any of your categories unless it's associated with a university or has a significant museum.
  • St. Leonard's School: Another school that may not fit the theme.
  • Boys' High School: Unless this high school has a special collection or is associated with a university, it might not fit the theme.
  • School of Medicine for Women: This could be relevant if it's part of a university, but if it's a standalone school, it might not fit your categories.
  • Virginia Military Institute: It might be relevant if it's conducting scientific research or has a significant museum, but otherwise, it could be considered an outlier.
  • Stowlangtoft Hall: Without more context, it's unclear how this fits into your categories. If it's simply a residential hall, it might be an outlier.
  • Stoke-on-Trent Athenaeum: An Athenaeum could house a collection or be an institution of learning, but if it's primarily a social club or event venue, it might not fit your categories.
  • Tristan da Cunha: This appears to be a remote island, not an institution or organization.
  • Pitcairn: Another remote island that might not fit your categories unless there's a significant nature reserve or research station there.
  • Polytechnic of North London: It's unclear if this is a university or a secondary school. If it's the latter, it might not fit the theme.
  • Peper Harow: This appears to be a residential estate or small village, which might not fit your categories.
  • Plymouth Institution and Athenaeum: Like Stoke-on-Trent Athenaeum, it could be relevant if it houses a significant collection or is an institution of learning.
  • Oundle School: Appears to be a secondary school, which could be an outlier.
  • Ellesmere Chambers: Without more context, it's unclear how this fits into your categories.
  • Puslinch House: Seems to be a residential location, so it could be an outlier unless it houses a significant collection.
  • Paisley Philosophical Institute: Philosophical societies often have libraries or collections, but if this one doesn't, it might not fit your categories.
  • Lathallan Preparatory School: A preparatory school could be considered an outlier in your list.
  • Murray Royal Institution: This seems to be a psychiatric hospital, which could be an outlier unless it's associated with a university or has a significant museum.
  • Philosophical Institution of Newport: Similar to Paisley Philosophical Institute, if it doesn't house a collection, it might be an outlier.
  • Harrow School: Another school that could be an outlier.
  • Queen Ethelburga's School: Yet another school that might not fit your categories.
  • Hampshire Cultural Trust: This organization is broader than just natural sciences, biodiversity, or research as it promotes all kinds of cultural activities and heritage within Hampshire, UK.
  • Government of South Georgia: This is a governmental organization that has a broader purpose beyond just scientific research, collection, or conservation.
  • Foundation for Revitalisation of Local Health Traditions: This organization appears to focus on traditional health practices, which may or may not align with the rest of the entities on the list, depending on the context.
  • Forestry Commission: This is a government department in the UK responsible for the protection and expansion of forests and woodlands. Its scope may be broader or different from the other organizations on your list.
  • National Oceanic and Atmospheric Administration (NOAA): This is a scientific agency within the United States Department of Commerce. It focuses on the conditions of the oceans, major waterways, and the atmosphere, which might be broader than the other organizations on your list.
  • The Children's Hospital at Westmead: This seems to be a clear outlier. It is a hospital dedicated to pediatric healthcare and does not fit with the theme of universities, research institutes, and collections tied to biodiversity and natural sciences.
  • ILSbio, LLC Biobank: A private company providing biobanking services, which is different in nature from the mainly public or nonprofit entities on your list.
  • DOI/FWS, Kenai National Wildlife Refuge: This is a wildlife refuge managed by the U.S. Fish and Wildlife Service, which might have a more specific or narrower focus compared to some other organizations on your list.
  • Taxon Expeditions: This seems to be more of a tour or travel company, with a focus on scientific expeditions, rather than an institution, museum or collection.
  • Afribugs: From the name, it's difficult to discern whether this is an institution, scientific collection, museum, or other. It seems to imply a collection of some sort, but without further context, it may be worth checking.
  • HHS/NIH, National Institutes of Health: While this is a governmental department, its focus is health and medical research, so it may not align with the natural or scientific focus of the other entities.
  • National Institute of Child Health and Human Development: Similar to the above, while this is a governmental organization, it's not directly related to nature, ecology, or physical sciences.
  • Microbial Carbohydrate Resource Bank: This sounds like a biotechnological resource, which might not fit the 'nature parks, museums, and universities' context perfectly.
  • Lamont-Doherty Core Repository, Gulf Coast Repository, Louisiana Geological Survey Core Repository: These repositories likely consist of scientific collections, but their alignment with the rest might depend on the specific focus of your study or project.
  • U.S. Environmental Protection Agency: This is a governmental department with a focus on the environment, but it may or may not be directly relevant depending on the context.
  • The Bacteriophage Bank of Korea: This is a biological resource center, and its inclusion depends on whether biological collections of this sort are relevant to your criteria.

Remember that these are just potential outliers and it's always good to check their relevance depending on your specific context or project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant