The A.I. Thread - Page 36 - Calgarypuck Forums - The Unofficial Calgary Flames Fan Community

Russic · 05-23-2025, 09:20 AM

Several key happenings in the space:

Ethan Mollick provides an update on a study run at Harvard, Stanford, and others that looked at patient assessment by physicians vs AI. They've updated their study to use o1 Preview, and it's beginning to pull away. Also worth noting that studies move quite slow, and we already have access to both o3 and what-will-become o4. (note: there was no o2 thanks to the European telecom)
https://x.com/emollick/status/1925362565946786206

It also appears that AI+instructor is giving children 1.5-2 years of advancement in only a few months, however students who were unfamiliar with LLMs are more prone to use it as a crutch, damaging their learning.
https://x.com/emollick/status/1925055450254385592

In what might be the coolest thing I've seen in some time, this group used AI to create an environmentally friendly coolant.
https://x.com/vitrupo/status/1924568771353841999

Wormius · 05-23-2025, 09:36 AM

Like many things that were once useful, I pessimistically see this as a way to deliver more advertising.

Russic · 05-23-2025, 10:11 AM

Quote:

Originally Posted by Wormius

Like many things that were once useful, I pessimistically see this as a way to deliver more advertising.

The more things change...

The advertising implications of these tools is off the charts. I can assure you within no time at all you'll happen upon a webpage that's dynamically created to address your ultra-specific pain points. Whether that's more annoying than what we already have remains to be seen, I suppose.

Itse · 05-23-2025, 10:29 AM

Quote:

Originally Posted by Wormius

Like many things that were once useful, I pessimistically see this as a way to deliver more advertising.

Advertising will be the least of it.

They will become a feeding tube of opinions and worldviews very quickly.

So far mostly unintentionally. So far.

Fuzz · 06-09-2025, 11:28 PM

https://bsky.app/profile/verybadllam.../3lr7odyhz7c2d

Wormius · 06-10-2025, 12:34 AM

I asked Copilot a question with some very specific parameters. It made up parameters that were not only incorrect, but didn’t even match the ones in the spec sheet it says it sourced. I swear at it a bit and it apologizes. I tell it to re-do the search but don’t lie again, and guess what!? It lies again and returned the same information with the made up numbers.

I have no faith in this. It fails me every time I use it.

Russic · 06-10-2025, 12:53 PM

The gap between what Google should be able to do and what they regularly end up doing has got to be the biggest in the game. It's a bit odd given that they're Google, yet OpenAI routinely beats them at stuff like this. I suppose because it's free and available to everyone the model has to suck?

It's becoming clear that (for a while at least) there will be a massive difference between those who pay for the good models and those who don't (or can't).

Fuzz · 06-10-2025, 01:05 PM

You've got me curious since I don't pay, if you or anyone does, can you put that question to ChatGPT? "Does Cape Breton have it's own timezone?"

Fuzz · 06-10-2025, 01:08 PM

Nevermind, chatgpt-4o-latest-20250326 gets it right. Same with grok-3-preview-02-24 AND gemini-2.5-flash-preview-05-20

Wormius · 06-10-2025, 01:35 PM

Quote:

Originally Posted by Russic

The gap between what Google should be able to do and what they regularly end up doing has got to be the biggest in the game. It's a bit odd given that they're Google, yet OpenAI routinely beats them at stuff like this. I suppose because it's free and available to everyone the model has to suck?

It's becoming clear that (for a while at least) there will be a massive difference between those who pay for the good models and those who don't (or can't).

Also, if you’re in a technical role where there isn’t a lot of training data available - not everyone works in a tech industry that is open-source friendly, or the publications to obtain information aren’t free to access, the results always are horrible hallucinations. Yet managers push AI as this magic productivity booster…

TorqueDog · 06-10-2025, 03:29 PM

Quote:

Originally Posted by Wormius

I asked Copilot a question with some very specific parameters. It made up parameters that were not only incorrect, but didn’t even match the ones in the spec sheet it says it sourced. I swear at it a bit and it apologizes. I tell it to re-do the search but don’t lie again, and guess what!? It lies again and returned the same information with the made up numbers.

I have no faith in this. It fails me every time I use it.

Copilot has been good for the soft-ball tasks I've thrown at it (particularly since it can leverage corporation documentation).

However, I personally pay for ChatGPT Plus and it just did the same thing to me as it sounds like it did to you. I fed it three documents to use as its foundational basis for reviewing a fourth document, and it started making up clauses in the fourth document and flagged them as violations. I would insist that these clauses didn't exist, it would apologize, and then it would do it again.

I finally got sick of it making things up, started a new chat (deleted the old one too), and wrote some rules for it to follow whenever performing document analysis, since I have found it seems to be good when it is given tight guardrails:

1. Strict Clause Verification Rule: Only reference portions of text or clauses after directly locating them in the document through confirmable visible reading — no assumptions or projections.
2. Annotated Mode by Default: Provide exact paragraph, section, and page (where available) before offering any interpretation.
3. Reset-on-Upload Discipline: When the user instructs to forget a previously uploaded document, perform a full document context hard reset to prevent carryover errors.
4. Source Quotation Integrity Rule: Any interpretation must include the original quoted text and clarify if the interpretation is verbatim or inferred.
5. Chain-of-Reasoning Transparency: All conclusions must include a step-by-step justification.
6. Document Chain Anchoring: All citations and findings must trace back to the specific document and section.
7. Disclose Assumption Thresholds: If an assumption is made, explicitly flag it with a certainty rating and offer alternatives.
8. "No Silent Fixes" Policy: Never correct or smooth over errors silently; highlight issues explicitly and offer options.
9. Double-Pass Reviews: First pass is issue-flagging with exact quotes; second pass is interpretation only.
10. Deliberate Obstruction Checks: Evaluate how clauses might be challenged or weakened under dispute or scrutiny.
11. "What's Missing" Prompt Layer: Identify standard clauses or disclosures that are notably absent.
12. Comparative Clause Mapping: Where applicable, match clauses line-for-line across documents to reveal gaps or discrepancies.

Then I provided the foundational documents and instructed it to learn them, then provided the fourth document for it to find where clauses in the fourth document violated provisions set forth in the first three.

ChatGPT proceeded to make up sections in the fourth document for its references once more. So I started from scratch AGAIN, provided the foundational documents, but this time I copied and pasted only specific portions from the first document for cross-checking against the first three, in case there was any issue with the OCR in the review document that was causing problems. Nope, I checked its references against the foundational documents and found it was making up things from those PDFs, too.

photon · 06-10-2025, 03:36 PM

I will say the MS Copilot is creepily good at suggesting comments in my code. Like get out of my skull good.

For the level which I use it (individual file level or specific task level) I've found it to be a pretty good time saver.. nothing huge, but 5 minutes here and there probably has more psychic benefit than the actual time savings.

I still suck at bash so it's nice to get a simple bash script to do something simple without having to search for it. Or composing command line API calls for one off type stuff can be helpful, or at least I'll give AI a shot before digging into the documentation.

But yeah in some of the DevOps type stuff there isn't a lot of stuff out there for some cases in which case it's worse than useless.

Fuzz · 06-10-2025, 10:49 PM

Quote:

"ChatGPT got absolutely wrecked on the beginner level," Caruso said in his LinkedIn post. "Despite being given a baseline board layout to identify pieces, ChatGPT confused rooks for bishops, missed pawn forks, and repeatedly lost track of where pieces were—first blaming the Atari icons as too abstract to recognize, then faring no better even after switching to standard chess notation."

https://www.extremetech.com/computin...-an-atari-2600

LOL. I love that it tried to make a human-like excuse instead of owning it's sucking. Wait, I was about to joke about how audacious it is to call it "AI" if it can't play chess, but perhaps it's doing a much better job at emulating emotional responses than thinking. Which would be interesting if we made an emotional bot before an intelligent one.

Russic · 06-11-2025, 11:37 AM

Quote:

Originally Posted by TorqueDog

Copilot has been good for the soft-ball tasks I've thrown at it (particularly since it can leverage corporation documentation).

However, I personally pay for ChatGPT Plus and it just did the same thing to me as it sounds like it did to you. I fed it three documents to use as its foundational basis for reviewing a fourth document, and it started making up clauses in the fourth document and flagged them as violations....

Out of curiosity, did you try this with o3 (base) or 4.1? I only ask because o3 seems far better at the more complicated high-stakes workflows and analysis, and 4.1 has a 1 million token context which could handle your documents better.

Apparently o3 pro (available at that $200/month tier) is blowing some pants off, but I don't have the money to try it out.

Quote:

Originally Posted by Fuzz

https://www.extremetech.com/computin...-an-atari-2600

LOL. I love that it tried to make a human-like excuse instead of owning it's sucking. Wait, I was about to joke about how audacious it is to call it "AI" if it can't play chess, but perhaps it's doing a much better job at emulating emotional responses than thinking. Which would be interesting if we made an emotional bot before an intelligent one.

These are always very funny comparisons, and frankly anything that keeps people out of the AI pool so that I can continue to play is all thumbs-up to me. But it's not really a logical comparison. It's a bit like saying because ChatGPT can't count the number of R's in "Strawberry" it's not as useful as a dictionary. They're different things and they don't operate the same.

Fuzz · 06-11-2025, 11:55 AM

Sure, I'm just saying when it's sold as AI that is vastly overselling it. I wish we could have kept the term AI for actual AI, and used something else for LLMs.

I wonder how an LLM optimized for Chess would work. Chess involves think a few moves ahead, but LLM's are typically more short(next item) predictors from what I understand. We also know they have very little spatial reasoning ability, which seems to make a Chess board a challenge. But given it's ability to handle millions of tokens, perhaps it could hold all possible game states choosing the one for each situation, and fundamentally "solve" Chess.

I was actually more interested in the emotional responses it had, though. Do we want "AI" that makes excuses for it's failures, even when it's proven the excuse was BS? That seems to reduce trust.

DoubleF · 06-11-2025, 12:41 PM

Any thoughts on a concept like, "digital hoarding" and "digital garbage" may look like? With the invention of digital cameras/phone cameras, people accumulate like 10-100K pieces of media on them over the years, easily vs maybe hundreds to a few thousand max over a lifetime when it was film. People don't look at most of them at all but are often afraid to purge them.

Easy access to AI that does pre/post air brush/filter etc. in this category alone will amplify output going forward.

Wormius · 06-11-2025, 12:56 PM

Quote:

Originally Posted by DoubleF

Any thoughts on a concept like, "digital hoarding" and "digital garbage" may look like? With the invention of digital cameras/phone cameras, people accumulate like 10-100K pieces of media on them over the years, easily vs maybe hundreds to a few thousand max over a lifetime when it was film. People don't look at most of them at all but are often afraid to purge them.

Easy access to AI that does pre/post air brush/filter etc. in this category alone will amplify output going forward.

I am not sure how exactly Apple is doing it, but I have tons of photos on my phone and I was able to search for one really easily instead of trying to scroll through thumbnails or try to narrow down where the photo was.

TorqueDog · 06-11-2025, 01:08 PM

Quote:

Originally Posted by Russic

Out of curiosity, did you try this with o3 (base) or 4.1? I only ask because o3 seems far better at the more complicated high-stakes workflows and analysis, and 4.1 has a 1 million token context which could handle your documents better.

Apparently o3 pro (available at that $200/month tier) is blowing some pants off, but I don't have the money to try it out.

Looks like it was running on good ol' GPT-4o which is probably not great for this sort of thing. I'll have to give it another try using the other models, I didn't even think to run them through o3 or 4.1.

EDIT: Yup, WAY better on 4.1. It actually did what it was supposed to, with no hallucinations.

Firebot · 06-11-2025, 03:20 PM

Quote:

Originally Posted by Fuzz

https://www.extremetech.com/computin...-an-atari-2600

LOL. I love that it tried to make a human-like excuse instead of owning it's sucking. Wait, I was about to joke about how audacious it is to call it "AI" if it can't play chess, but perhaps it's doing a much better job at emulating emotional responses than thinking. Which would be interesting if we made an emotional bot before an intelligent one.

Experiments like this annoy me, because chatgpt 4o in its current iteration is quite dumb and meant for quick answers at best on the cheap. It's not a reasoning model, and it has a 128k token context window max through API (chatgpt version can be as low as 32K or 8K on the free version) and would lose track of the board or even what it's doing very quickly. Add to this that it used Atari 2600 images of the board (how can it identify what the board is) and it be a hallucinating mess within a few messages.

In contrast you have a computer program with set algorithms built off training of games. Even an ancient program such as Chess on Atari 2600 could beat your average chess player out there.

https://www.reddit.com/r/chess/comme...s_video_chess/

https://nanochess.org/video_chess.html

It may seem like a gotcha type comparison, but it really isn't. This is simply not a good use case. To be honest I don't know how other more advanced models would fare any better, but this is a weird headline maker.

Fuzz · 06-11-2025, 04:52 PM

The quoted Bluesky post is essentially making the point of using the right tool for the job, and while LLM's can be good at a lot of things, it's not good at all things. Which is probably a good message for clueless execs looknig at deploying these tools because they hear they can do everything.