Senior Software Engineer — AI Evaluation & Benchmarks (Python)
Auto ImportShare
<h2><strong>Before Applying</strong></h2><p style="min-height:1.5em">This role is open to contractors in accepted locations only. Please confirm your country is on the list before applying — we're unable to process applications from unlisted locations. <a target="_blank" rel="noopener noreferrer nofollow" class="underline underline underline-offset-2 decoration-1 decoration-current/40 hover:decoration-current focus:decoration-current" href="https://docs.google.com/document/d/1FK0v1X3O3rqY0oB2k5xt0u5eiYaoYYKv_E4XS3kHXUs/edit?tab=t.0#heading=h.8jwvoue7ks7z">List of accepted countries and locations.</a></p><p style="min-height:1.5em">For US applicants: This is a 1099 independent contractor role. It is not compatible with F-1 OPT, STEM OPT, or any visa status that requires W-2 employment, guaranteed hours, or employer sponsorship. We are unable to provide offer letters or employment verification for this role.</p><p style="min-height:1.5em"></p><h2><strong>What You'll Be Doing</strong></h2><p style="min-height:1.5em">Design and build the coding benchmarks and evaluation pipelines used to test frontier AI models on real software engineering work:</p><ul style="min-height:1.5em"><li><p style="min-height:1.5em">Design coding benchmarks that evaluate frontier models on real-world programming tasks — reasoning, debugging, and production-quality code</p></li><li><p style="min-height:1.5em">Build and maintain scalable data pipelines for evaluation workflows</p></li><li><p style="min-height:1.5em">Analyze model-generated code for correctness, reliability, and edge-case failures</p></li><li><p style="min-height:1.5em">Construct structured evaluation scenarios across large repos and multi-language environments</p></li><li><p style="min-height:1.5em">Provide detailed technical feedback on model performance and failure patterns</p></li><li><p style="min-height:1.5em">Contribute to evaluation frameworks that set the bar for how coding ability is measured</p></li></ul><p style="min-height:1.5em">End result: benchmarks that meaningfully separate what frontier models can and can't do — and shape how the next generation is trained and improved.</p><p style="min-height:1.5em">AI coding evaluation in one line: Design task → build harness → run model → analyze failures → feed findings back into the benchmark → evaluations that actually distinguish strong models from weak ones.</p><p style="min-height:1.5em"></p><h2><strong>What You'll Need</strong></h2><ul style="min-height:1.5em"><li><p style="min-height:1.5em">4+ years of professional software engineering experience (non-negotiable)</p></li><li><p style="min-height:1.5em">Expert Python — clean, performant, well-tested code</p></li><li><p style="min-height:1.5em">Hands-on experience working in large, complex codebases</p></li><li><p style="min-height:1.5em">Proven experience designing and implementing LLM coding benchmarks and evaluation data pipelines</p></li><li><p style="min-height:1.5em">Strong command of Git and modern development workflows</p></li><li><p style="min-height:1.5em">Track record at a high-growth tech company or top-tier software organization</p></li><li><p style="min-height:1.5em">Strong written English communication</p></li></ul><p style="min-height:1.5em">Identity verification: Applicants will be required to verify their identity and confirm they have valid documentation to work as an independent contractor in their country of residence.</p><p style="min-height:1.5em"></p><h2><strong>Nice to have</strong></h2><ul style="min-height:1.5em"><li><p style="min-height:1.5em">Senior or Lead-level profile with a history of technical ownership</p></li><li><p style="min-height:1.5em">Bachelor's or Master's in CS, ML, or related field (or equivalent professional experience)</p></li><li><p style="min-height:1.5em">Proficiency in additional languages: JavaScript, Go, C++, or others</p></li><li><p style="min-height:1.5em">CI/CD experience and writing robust unit tests (pytest, Mocha, JUnit)</p></li><li><p style="min-height:1.5em">Background in security engineering or significant open-source contributions</p></li><li><p style="min-height:1.5em">Familiarity with AI/ML evaluation methodologies or model benchmarking</p></li></ul><p style="min-height:1.5em"></p><h2><strong>Logistics</strong></h2><ul style="min-height:1.5em"><li><p style="min-height:1.5em">Location: Fully remote — work from anywhere on the accepted locations list</p></li><li><p style="min-height:1.5em">Compensation: $80–$100/hr based on location and seniority</p></li><li><p style="min-height:1.5em">Contract length: 3 months, with potential for extension</p></li><li><p style="min-height:1.5em">Hours: Full-time availability preferred — hours vary by project and are not guaranteed week to week</p></li><li><p style="min-height:1.5em">Engagement: 1099 independent contractor</p></li><li><p style="min-height:1.5em">Payment: Weekly via PayPal or Stripe</p></li></ul><p style="min-height:1.5em">⚠️ Important: Hours are project-dependent and can vary week to week. We recommend keeping other work options open alongside this engagement rather than relying on it as your sole source of income.</p>