Revolutionizing Data Labeling: Insights from Devang Sachdev of Snorkel AI
Development, Ethics & Society, Interviews, Machine Learning
Snorkel AI: Streamlining the Data Labeling Process
Correctly labeling training data for AI models is essential to prevent significant issues. However, the manual labeling of vast datasets can be both time-consuming and arduous. Reliance on pre-labeled datasets has also posed challenges, as shown by MIT’s recall of its 80 Million Tiny Images dataset due to the presence of numerous racist and misogynistic labels that could misguide AI models.
In a recent discussion, we spoke with Devang Sachdev, VP of Marketing at Snorkel AI, about how the company is addressing the cumbersome data labeling process effectively and safely.
How is Snorkel enhancing the data labeling process?
Devang Sachdev explained: “Snorkel Flow transforms the traditional manual training data labeling into a programmatic approach that we’ve demonstrated can accelerate the creation of training data by 10 to 100 times. Users can capture their knowledge and available resources—both internal, like ontologies, and external, such as foundation models—into labeling functions applied at scale.”
He continued, “Unlike conventional, rules-based methods that may lack coverage or conflict, Snorkel Flow employs theoretically grounded weak supervision techniques to intelligently merge labeling functions and auto-label your dataset using an optimal model. With this initial dataset, users can easily train a larger machine learning model from our ‘Model Zoo.’ This process enables the model to:
- Generalize beyond the model’s output.
- Conduct model-guided error analysis to pinpoint confusion areas, providing auto-generated suggestions and tools for data exploration and tagging for further refinement.
This rapid, iterative, and flexible method resembles software development more than the tedious, manual labor traditionally associated with data labeling. Much like software, it allows users to inspect and adapt the code that generates training data labels.
Are there risks in over-automating the labeling process?
According to Sachdev, “The labeling process itself carries inherent risks due to human fallibility. Factors such as fatigue, errors, and underlying biases can inadvertently be encoded into the model through manual labeling. When inaccuracies or biases arise, the model may amplify these issues, causing detrimental outcomes across various applications, from lending inequalities to hiring discrimination and missed medical diagnoses.”
He added, “Moreover, there are practical dangers in over-automating and removing the human element from training data development. Training data embodies human expertise, and while some instances may not demand specialized input for labeling, these instances are exceptions rather than the rule. For effective training data, it’s crucial to encapsulate the complete knowledge of subject matter experts and the diverse resources they utilize for decision-making. Engaging highly sought-after experts for manual labeling is not scalable and results in untapped value.”
To address these issues, a programmatic approach to data labeling should be adopted, shifting focus from model-centric to data-centric AI workflows. This involves:
- Allowing domain experts to translate their expertise into scalable labeling processes rather than performing tedious one-by-one annotations.
- Implementing weak supervision to efficiently auto-label data at scale.
The auto-magic of this approach is inherently transparent and supported by theoretical foundations. Each training data label applied in this phase can be scrutinized, providing insights into the reasons behind its designation. By integrating experts into the AI development process, teams can enhance iteration and troubleshoot effectively. Utilizing streamlined workflows within the Snorkel Flow platform enables data scientists to work collaboratively as subject matter experts, identifying the root causes of error modes and determining how to rectify them through straightforward updates or corrections to labels that error analysis reveals as inaccurate.
AN: How simple is it to identify and update labels in response to real-world changes?
DS: A key advantage of Snorkel Flow’s data-centric approach to AI development is its adaptability. Real-world changes, such as production data drift or evolving business objectives, are unavoidable. Snorkel Flow employs programmatic labeling, allowing for efficient responses to these changes. By contrast, traditional methods require relabeling an entire training dataset—which could involve thousands to hundreds of thousands of data points—whenever there’s a shift in objectives. This process could take weeks or even months before addressing the new requirements.
However, using Snorkel Flow, updating a schema is straightforward; it entails writing a few additional labeling functions to accommodate new classes and leveraging weak supervision to merge all labeling functions, enabling model retraining efficiently. To monitor for data drift in production, one can utilize existing monitoring systems or employ Snorkel Flow’s production APIs to input live data back into the platform and evaluate model performance against current data. When performance issues arise, the same analytical workflow can be followed—discovering patterns through error analysis, utilizing suggested actions, and iterating with subject matter experts to enhance and expand labeling functions.
AN: MIT had to withdraw its ‘80 Million Tiny Images’ dataset after discovering it contained racist and misogynistic labels, a result of using an automated collection method based on WordNet. How does Snorkel ensure it avoids such labeling issues that can lead to harmful biases in AI systems?
DS: Bias can emerge at any stage in the process—whether in pre-processing, post-processing, task design, or modeling choices, particularly concerning labeled training data. Understanding the underlying bias requires insight into the rationale applied by labelers, which is often unfeasible when every data point is manually labeled without documenting the reasoning behind each decision. Moreover, the authoring information for labels and dataset versioning is typically lacking. Labeling might be outsourced, or in-house labelers may have moved on to other projects.
Snorkel AI’s programmatic labeling strategy helps uncover, manage, and address bias. Rather than disregarding the rationale behind every labeled data point, Snorkel Flow captures the knowledge of labelers—be they subject matter experts, data scientists, or other professionals—as labeling functions. This approach generates probabilistic labels using theoretically grounded algorithms, encoded in an innovative label model. Through Snorkel Flow, users can grasp why a specific data point was labeled as it is. This method, along with version control for labeling functions and datasets, empowers users to audit, interpret, and explain model behaviors, marking a pivotal transition from manual to programmatic labeling in managing bias.
AN: A team led by Snorkel researcher Stephen Bach recently published a paper on Zero-Shot Learning with Common Sense Knowledge Graphs (ZSL-KG). While I’d encourage readers to consult the paper for comprehensive information, could you provide a brief overview of its significance and advantages over existing WordNet-based methods?
DS: ZSL-KG enhances graph-based zero-shot learning in two fundamental aspects: richer models and richer datasets. On the modeling front, ZSL-KG employs a novel type of graph neural network known as a transformer graph convolutional network (TrGCN). Traditional graph neural networks typically represent nodes through linear combinations of neighboring representations, which can be limiting. In contrast, TrGCN utilizes small transformers at each node to amalgamate neighborhood representations in more sophisticated manners.
On the data side, ZSL-KG utilizes common sense knowledge graphs that incorporate natural language and graph structures to explicitly represent various types of relationships between concepts. This method provides a significantly richer data source than the conventional ImageNet subtype hierarchy.
AN: Gartner recognized Snorkel as a ‘Cool Vendor’ in its 2022 AI Core Technologies report. What do you believe distinguishes you from your competitors?
DS: Data labeling constitutes one of the most significant challenges in enterprise AI. Many organizations acknowledge that current methods are unscalable and frequently plagued by quality, explainability, and adaptability issues. Snorkel AI not only addresses the automation of data labeling but uniquely offers a development platform that adopts a data-centric approach while harnessing knowledge resources from subject matter experts and existing systems.
Additionally, Snorkel AI brings together over seven years of research and development—initially established at the Stanford AI Lab—alongside a talented team of machine learning engineers, success managers, and researchers. This collective expertise aids in customer development while driving market innovations. Snorkel Flow integrates all necessary elements of a programmatic, data-centric AI development workflow—covering training data creation and management, model iteration, error analysis tools, and data/application export or deployment. Each stage of the platform is fully interoperable through a Python SDK and various connectors.
This unified platform facilitates an intuitive interface and efficient workflows, fostering critical collaboration between annotators, data scientists, and other roles to expedite AI development. It empowers teams to iterate on both data and models within a singular platform, using insights from one to inform the development of the other, ultimately leading to accelerated development cycles.
The Snorkel AI team will be sharing their valuable insights at this year’s AI & Big Data Expo North America. Be sure to check it out and visit Snorkel’s booth at stand #52.
MedTech AI, Hardware, and Clinical Application Programmes
Subscribe now to receive all our premium content and the latest tech news delivered directly to your inbox.
Artificial Intelligence and Machine Learning are becoming increasingly vital in various fields, including cloud-native container security. These technologies are enhancing security measures significantly.
Innovative implementations of machine learning are revolutionizing business applications across sectors such as finance and logistics. Companies are leveraging these advancements to streamline operations and improve efficiency.
Another notable application involves Artificial Intelligence and face recognition technologies. Concerns have arisen regarding the use of AI and bots to fraudulently inflate music streaming numbers, posing ethical challenges for the industry.
In the realm of space exploration, collaborating with outsourced developers is proving beneficial. Such partnerships are enabling companies to tackle complex projects efficiently while utilizing specialized expertise.
Latest Articles
Magistral: Mistral AI Challenges Big Tech with Reasoning Model
Mistral AI is making a significant impact by introducing a reasoning model that actively competes with major technology firms. This innovative approach enhances the capabilities of artificial intelligence, particularly in understanding and processing complex tasks.
AI’s Influence in the Cryptocurrency Industry
The cryptocurrency landscape is continually evolving, and AI is playing a pivotal role in shaping its future. From improving transaction security to forecasting market trends, AI technologies are transforming how businesses operate in this dynamic sector. As adoption increases, the integration of AI tools will likely boost efficiency and reliability in cryptocurrency transactions.
Sam Altman and the Dawn of the Superintelligence Era
Sam Altman, CEO of OpenAI, asserts that we are entering a new era defined by superintelligence. This period heralds advancements in AI technologies that could fundamentally change various industries. The potential applications are vast, from enhancing automation in everyday tasks to improving decision-making processes across sectors.
Stay Updated with the Latest Tech News
For the latest updates on AI and technology, subscribe to our newsletter. Receive premium content and breaking news directly in your inbox, ensuring you stay informed about advancements that matter in the tech industry.
Explore Our Categories
- Applications
- Companies
- Deep & Reinforcement Learning
- Enterprise
- Ethics & Society
- Industries
- Legislation & Government
- Machine Learning
- Privacy
- Research
- Robotics
- Security
- Surveillance
- Sponsored Content
About AI News and Our Commitment
AI News is part of TechForge, dedicated to providing insightful and informative content about the rapidly evolving world of artificial intelligence. Our goal is to keep you updated on the innovations and ethical considerations that shape the industry.