Calgarypuck Forums - The Unofficial Calgary Flames Fan Community

Go Back   Calgarypuck Forums - The Unofficial Calgary Flames Fan Community > Main Forums > The Off Topic Forum
Register Forum Rules FAQ Community Calendar Today's Posts Search

Reply
 
Thread Tools Search this Thread
Old 05-23-2025, 08:20 AM   #701
Russic
Dances with Wolves
 
Russic's Avatar
 
Join Date: Jun 2006
Location: Section 304
Exp:
Default

Several key happenings in the space:

Ethan Mollick provides an update on a study run at Harvard, Stanford, and others that looked at patient assessment by physicians vs AI. They've updated their study to use o1 Preview, and it's beginning to pull away. Also worth noting that studies move quite slow, and we already have access to both o3 and what-will-become o4. (note: there was no o2 thanks to the European telecom)
https://x.com/emollick/status/1925362565946786206

It also appears that AI+instructor is giving children 1.5-2 years of advancement in only a few months, however students who were unfamiliar with LLMs are more prone to use it as a crutch, damaging their learning.
https://x.com/emollick/status/1925055450254385592

In what might be the coolest thing I've seen in some time, this group used AI to create an environmentally friendly coolant.
https://x.com/vitrupo/status/1924568771353841999
Russic is offline   Reply With Quote
The Following User Says Thank You to Russic For This Useful Post:
Old 05-23-2025, 08:36 AM   #702
Wormius
Franchise Player
 
Wormius's Avatar
 
Join Date: Feb 2011
Location: Somewhere down the crazy river.
Exp:
Default

Like many things that were once useful, I pessimistically see this as a way to deliver more advertising.
Wormius is offline   Reply With Quote
The Following 2 Users Say Thank You to Wormius For This Useful Post:
Old 05-23-2025, 09:11 AM   #703
Russic
Dances with Wolves
 
Russic's Avatar
 
Join Date: Jun 2006
Location: Section 304
Exp:
Default

Quote:
Originally Posted by Wormius View Post
Like many things that were once useful, I pessimistically see this as a way to deliver more advertising.
The more things change...

The advertising implications of these tools is off the charts. I can assure you within no time at all you'll happen upon a webpage that's dynamically created to address your ultra-specific pain points. Whether that's more annoying than what we already have remains to be seen, I suppose.
Russic is offline   Reply With Quote
Old 05-23-2025, 09:29 AM   #704
Itse
Franchise Player
 
Itse's Avatar
 
Join Date: May 2004
Location: Helsinki, Finland
Exp:
Default

Quote:
Originally Posted by Wormius View Post
Like many things that were once useful, I pessimistically see this as a way to deliver more advertising.
Advertising will be the least of it.

They will become a feeding tube of opinions and worldviews very quickly.

So far mostly unintentionally. So far.
Itse is offline   Reply With Quote
Old 06-09-2025, 10:28 PM   #705
Fuzz
Franchise Player
 
Fuzz's Avatar
 
Join Date: Mar 2015
Location: Pickle Jar Lake
Exp:
Default




https://bsky.app/profile/verybadllam.../3lr7odyhz7c2d
Fuzz is offline   Reply With Quote
The Following 5 Users Say Thank You to Fuzz For This Useful Post:
Old 06-09-2025, 11:34 PM   #706
Wormius
Franchise Player
 
Wormius's Avatar
 
Join Date: Feb 2011
Location: Somewhere down the crazy river.
Exp:
Default

I asked Copilot a question with some very specific parameters. It made up parameters that were not only incorrect, but didn’t even match the ones in the spec sheet it says it sourced. I swear at it a bit and it apologizes. I tell it to re-do the search but don’t lie again, and guess what!? It lies again and returned the same information with the made up numbers.

I have no faith in this. It fails me every time I use it.
Wormius is offline   Reply With Quote
Old 06-10-2025, 11:53 AM   #707
Russic
Dances with Wolves
 
Russic's Avatar
 
Join Date: Jun 2006
Location: Section 304
Exp:
Default

The gap between what Google should be able to do and what they regularly end up doing has got to be the biggest in the game. It's a bit odd given that they're Google, yet OpenAI routinely beats them at stuff like this. I suppose because it's free and available to everyone the model has to suck?

It's becoming clear that (for a while at least) there will be a massive difference between those who pay for the good models and those who don't (or can't).
Russic is offline   Reply With Quote
Old 06-10-2025, 12:05 PM   #708
Fuzz
Franchise Player
 
Fuzz's Avatar
 
Join Date: Mar 2015
Location: Pickle Jar Lake
Exp:
Default

You've got me curious since I don't pay, if you or anyone does, can you put that question to ChatGPT? "Does Cape Breton have it's own timezone?"
Fuzz is offline   Reply With Quote
Old 06-10-2025, 12:08 PM   #709
Fuzz
Franchise Player
 
Fuzz's Avatar
 
Join Date: Mar 2015
Location: Pickle Jar Lake
Exp:
Default

Nevermind, chatgpt-4o-latest-20250326 gets it right. Same with grok-3-preview-02-24 AND gemini-2.5-flash-preview-05-20
Fuzz is offline   Reply With Quote
Old 06-10-2025, 12:35 PM   #710
Wormius
Franchise Player
 
Wormius's Avatar
 
Join Date: Feb 2011
Location: Somewhere down the crazy river.
Exp:
Default The A.I. Thread

Quote:
Originally Posted by Russic View Post
The gap between what Google should be able to do and what they regularly end up doing has got to be the biggest in the game. It's a bit odd given that they're Google, yet OpenAI routinely beats them at stuff like this. I suppose because it's free and available to everyone the model has to suck?

It's becoming clear that (for a while at least) there will be a massive difference between those who pay for the good models and those who don't (or can't).

Also, if you’re in a technical role where there isn’t a lot of training data available - not everyone works in a tech industry that is open-source friendly, or the publications to obtain information aren’t free to access, the results always are horrible hallucinations. Yet managers push AI as this magic productivity booster…
Wormius is offline   Reply With Quote
Old 06-10-2025, 02:29 PM   #711
TorqueDog
Franchise Player
 
TorqueDog's Avatar
 
Join Date: Jul 2010
Location: Calgary - Centre West
Exp:
Default

Quote:
Originally Posted by Wormius View Post
I asked Copilot a question with some very specific parameters. It made up parameters that were not only incorrect, but didn’t even match the ones in the spec sheet it says it sourced. I swear at it a bit and it apologizes. I tell it to re-do the search but don’t lie again, and guess what!? It lies again and returned the same information with the made up numbers.

I have no faith in this. It fails me every time I use it.
Copilot has been good for the soft-ball tasks I've thrown at it (particularly since it can leverage corporation documentation).

However, I personally pay for ChatGPT Plus and it just did the same thing to me as it sounds like it did to you. I fed it three documents to use as its foundational basis for reviewing a fourth document, and it started making up clauses in the fourth document and flagged them as violations. I would insist that these clauses didn't exist, it would apologize, and then it would do it again.

I finally got sick of it making things up, started a new chat (deleted the old one too), and wrote some rules for it to follow whenever performing document analysis, since I have found it seems to be good when it is given tight guardrails:

1. Strict Clause Verification Rule: Only reference portions of text or clauses after directly locating them in the document through confirmable visible reading — no assumptions or projections.
2. Annotated Mode by Default: Provide exact paragraph, section, and page (where available) before offering any interpretation.
3. Reset-on-Upload Discipline: When the user instructs to forget a previously uploaded document, perform a full document context hard reset to prevent carryover errors.
4. Source Quotation Integrity Rule: Any interpretation must include the original quoted text and clarify if the interpretation is verbatim or inferred.
5. Chain-of-Reasoning Transparency: All conclusions must include a step-by-step justification.
6. Document Chain Anchoring: All citations and findings must trace back to the specific document and section.
7. Disclose Assumption Thresholds: If an assumption is made, explicitly flag it with a certainty rating and offer alternatives.
8. "No Silent Fixes" Policy: Never correct or smooth over errors silently; highlight issues explicitly and offer options.
9. Double-Pass Reviews: First pass is issue-flagging with exact quotes; second pass is interpretation only.
10. Deliberate Obstruction Checks: Evaluate how clauses might be challenged or weakened under dispute or scrutiny.
11. "What's Missing" Prompt Layer: Identify standard clauses or disclosures that are notably absent.
12. Comparative Clause Mapping: Where applicable, match clauses line-for-line across documents to reveal gaps or discrepancies.

Then I provided the foundational documents and instructed it to learn them, then provided the fourth document for it to find where clauses in the fourth document violated provisions set forth in the first three.

ChatGPT proceeded to make up sections in the fourth document for its references once more. So I started from scratch AGAIN, provided the foundational documents, but this time I copied and pasted only specific portions from the first document for cross-checking against the first three, in case there was any issue with the OCR in the review document that was causing problems. Nope, I checked its references against the foundational documents and found it was making up things from those PDFs, too.
__________________
-James
GO
FLAMES GO.

Quote:
Originally Posted by Azure
Typical dumb take.
TorqueDog is offline   Reply With Quote
The Following 3 Users Say Thank You to TorqueDog For This Useful Post:
Old 06-10-2025, 02:36 PM   #712
photon
The new goggles also do nothing.
 
photon's Avatar
 
Join Date: Oct 2001
Location: Calgary
Exp:
Default

I will say the MS Copilot is creepily good at suggesting comments in my code. Like get out of my skull good.

For the level which I use it (individual file level or specific task level) I've found it to be a pretty good time saver.. nothing huge, but 5 minutes here and there probably has more psychic benefit than the actual time savings.

I still suck at bash so it's nice to get a simple bash script to do something simple without having to search for it. Or composing command line API calls for one off type stuff can be helpful, or at least I'll give AI a shot before digging into the documentation.

But yeah in some of the DevOps type stuff there isn't a lot of stuff out there for some cases in which case it's worse than useless.
__________________
Uncertainty is an uncomfortable position.
But certainty is an absurd one.
photon is offline   Reply With Quote
The Following User Says Thank You to photon For This Useful Post:
Old 06-10-2025, 09:49 PM   #713
Fuzz
Franchise Player
 
Fuzz's Avatar
 
Join Date: Mar 2015
Location: Pickle Jar Lake
Exp:
Default




Quote:
"ChatGPT got absolutely wrecked on the beginner level," Caruso said in his LinkedIn post. "Despite being given a baseline board layout to identify pieces, ChatGPT confused rooks for bishops, missed pawn forks, and repeatedly lost track of where pieces were—first blaming the Atari icons as too abstract to recognize, then faring no better even after switching to standard chess notation."
https://www.extremetech.com/computin...-an-atari-2600


LOL. I love that it tried to make a human-like excuse instead of owning it's sucking. Wait, I was about to joke about how audacious it is to call it "AI" if it can't play chess, but perhaps it's doing a much better job at emulating emotional responses than thinking. Which would be interesting if we made an emotional bot before an intelligent one.
Fuzz is offline   Reply With Quote
The Following User Says Thank You to Fuzz For This Useful Post:
Old 06-11-2025, 10:37 AM   #714
Russic
Dances with Wolves
 
Russic's Avatar
 
Join Date: Jun 2006
Location: Section 304
Exp:
Default

Quote:
Originally Posted by TorqueDog View Post
Copilot has been good for the soft-ball tasks I've thrown at it (particularly since it can leverage corporation documentation).

However, I personally pay for ChatGPT Plus and it just did the same thing to me as it sounds like it did to you. I fed it three documents to use as its foundational basis for reviewing a fourth document, and it started making up clauses in the fourth document and flagged them as violations....
Out of curiosity, did you try this with o3 (base) or 4.1? I only ask because o3 seems far better at the more complicated high-stakes workflows and analysis, and 4.1 has a 1 million token context which could handle your documents better.

Apparently o3 pro (available at that $200/month tier) is blowing some pants off, but I don't have the money to try it out.

Quote:
Originally Posted by Fuzz View Post




https://www.extremetech.com/computin...-an-atari-2600


LOL. I love that it tried to make a human-like excuse instead of owning it's sucking. Wait, I was about to joke about how audacious it is to call it "AI" if it can't play chess, but perhaps it's doing a much better job at emulating emotional responses than thinking. Which would be interesting if we made an emotional bot before an intelligent one.
These are always very funny comparisons, and frankly anything that keeps people out of the AI pool so that I can continue to play is all thumbs-up to me. But it's not really a logical comparison. It's a bit like saying because ChatGPT can't count the number of R's in "Strawberry" it's not as useful as a dictionary. They're different things and they don't operate the same.

Last edited by Russic; 06-11-2025 at 10:40 AM.
Russic is offline   Reply With Quote
Old 06-11-2025, 10:55 AM   #715
Fuzz
Franchise Player
 
Fuzz's Avatar
 
Join Date: Mar 2015
Location: Pickle Jar Lake
Exp:
Default

Sure, I'm just saying when it's sold as AI that is vastly overselling it. I wish we could have kept the term AI for actual AI, and used something else for LLMs.


I wonder how an LLM optimized for Chess would work. Chess involves think a few moves ahead, but LLM's are typically more short(next item) predictors from what I understand. We also know they have very little spatial reasoning ability, which seems to make a Chess board a challenge. But given it's ability to handle millions of tokens, perhaps it could hold all possible game states choosing the one for each situation, and fundamentally "solve" Chess.



I was actually more interested in the emotional responses it had, though. Do we want "AI" that makes excuses for it's failures, even when it's proven the excuse was BS? That seems to reduce trust.
Fuzz is offline   Reply With Quote
Old 06-11-2025, 11:41 AM   #716
DoubleF
Franchise Player
 
DoubleF's Avatar
 
Join Date: Apr 2014
Exp:
Default

Any thoughts on a concept like, "digital hoarding" and "digital garbage" may look like? With the invention of digital cameras/phone cameras, people accumulate like 10-100K pieces of media on them over the years, easily vs maybe hundreds to a few thousand max over a lifetime when it was film. People don't look at most of them at all but are often afraid to purge them.

Easy access to AI that does pre/post air brush/filter etc. in this category alone will amplify output going forward.
DoubleF is offline   Reply With Quote
Old 06-11-2025, 11:56 AM   #717
Wormius
Franchise Player
 
Wormius's Avatar
 
Join Date: Feb 2011
Location: Somewhere down the crazy river.
Exp:
Default

Quote:
Originally Posted by DoubleF View Post
Any thoughts on a concept like, "digital hoarding" and "digital garbage" may look like? With the invention of digital cameras/phone cameras, people accumulate like 10-100K pieces of media on them over the years, easily vs maybe hundreds to a few thousand max over a lifetime when it was film. People don't look at most of them at all but are often afraid to purge them.

Easy access to AI that does pre/post air brush/filter etc. in this category alone will amplify output going forward.

I am not sure how exactly Apple is doing it, but I have tons of photos on my phone and I was able to search for one really easily instead of trying to scroll through thumbnails or try to narrow down where the photo was.
Wormius is offline   Reply With Quote
Old 06-11-2025, 12:08 PM   #718
TorqueDog
Franchise Player
 
TorqueDog's Avatar
 
Join Date: Jul 2010
Location: Calgary - Centre West
Exp:
Default

Quote:
Originally Posted by Russic View Post
Out of curiosity, did you try this with o3 (base) or 4.1? I only ask because o3 seems far better at the more complicated high-stakes workflows and analysis, and 4.1 has a 1 million token context which could handle your documents better.

Apparently o3 pro (available at that $200/month tier) is blowing some pants off, but I don't have the money to try it out.
Looks like it was running on good ol' GPT-4o which is probably not great for this sort of thing. I'll have to give it another try using the other models, I didn't even think to run them through o3 or 4.1.

EDIT: Yup, WAY better on 4.1. It actually did what it was supposed to, with no hallucinations.
__________________
-James
GO
FLAMES GO.

Quote:
Originally Posted by Azure
Typical dumb take.

Last edited by TorqueDog; 06-11-2025 at 12:27 PM.
TorqueDog is offline   Reply With Quote
The Following User Says Thank You to TorqueDog For This Useful Post:
Old 06-11-2025, 02:20 PM   #719
Firebot
#1 Goaltender
 
Join Date: Jul 2011
Exp:
Default

Quote:
Originally Posted by Fuzz View Post




https://www.extremetech.com/computin...-an-atari-2600


LOL. I love that it tried to make a human-like excuse instead of owning it's sucking. Wait, I was about to joke about how audacious it is to call it "AI" if it can't play chess, but perhaps it's doing a much better job at emulating emotional responses than thinking. Which would be interesting if we made an emotional bot before an intelligent one.
Experiments like this annoy me, because chatgpt 4o in its current iteration is quite dumb and meant for quick answers at best on the cheap. It's not a reasoning model, and it has a 128k token context window max through API (chatgpt version can be as low as 32K or 8K on the free version) and would lose track of the board or even what it's doing very quickly. Add to this that it used Atari 2600 images of the board (how can it identify what the board is) and it be a hallucinating mess within a few messages.

In contrast you have a computer program with set algorithms built off training of games. Even an ancient program such as Chess on Atari 2600 could beat your average chess player out there.

https://www.reddit.com/r/chess/comme...s_video_chess/

https://nanochess.org/video_chess.html

It may seem like a gotcha type comparison, but it really isn't. This is simply not a good use case. To be honest I don't know how other more advanced models would fare any better, but this is a weird headline maker.

Last edited by Firebot; 06-11-2025 at 02:24 PM.
Firebot is offline   Reply With Quote
Old 06-11-2025, 03:52 PM   #720
Fuzz
Franchise Player
 
Fuzz's Avatar
 
Join Date: Mar 2015
Location: Pickle Jar Lake
Exp:
Default

The quoted Bluesky post is essentially making the point of using the right tool for the job, and while LLM's can be good at a lot of things, it's not good at all things. Which is probably a good message for clueless execs looknig at deploying these tools because they hear they can do everything.
Fuzz is offline   Reply With Quote
Reply

Tags
they will overtake us


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -6. The time now is 02:06 AM.

Calgary Flames
2024-25




Powered by vBulletin® Version 3.8.4
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright Calgarypuck 2021 | See Our Privacy Policy